software engineer, machine translation risk models
My current main research interests are natural language processing and especially machine translation. In practice to achieve anything interesting this means applying machine learning to human language. Concretely I am interested not so much in improving average accuracy but rather in achieving common sense and robustness in natural language systems nearer to the level observable in multilingual humans.
Beyond natural language processing and artificial intelligence my interests in software include search (information retrieval), transliteration and internationalization/internationalisation/i18n/i10n (ISO, international TLDs, Unicode, input methods...), security topics for privacy and freedom, finance and markets, programming languages, entrepreneurship and investing.
deepchar Transliteration with sequence-to-sequence models and transfer learning
github.com/fasttext Unofficial libraries for fastText
language Basic tools for working with natural language text data
NLP Guide Open guide to natural language processing
deeplanguageclass Class on machine learning for natural language processing
NoiseMix Data generation for natural language
fastent custom models for named-entity recognition
Since 2013 I have been active as a technical founder. From 2007 to 2013 I was Software Engineer at Google in Mountain View, California. At Google I initially joined the development of a proprietary business data storage platform and Android / Google Play, helped add Arabic transliteration to the Google Language API, and finally joined the Google Translate team.
On Translate my projects included eng ownership of the then launching Translation Manager for custom translations, the translation API integrated by the Chromium project ie the Google Chrome browser and experiments with automatic query correction to improve perceived translation quality. I also was happy to participate as a technical linguist and polyglot in efforts around language identification, error/quality analyses, query/index normalisation and new language launches.
While at Google I also had the pleasure of completing Stanford University’s CS121 Introduction to Artificial Intelligence and CS276 Information Retrieval and Web Search taught by the authors of the book themselves, Prof. Chris Manning and then Head of Yahoo Research Prabhakar Raghavan.
Before Google I interned as a software engineer at Adobe Systems and Cerner, while on my way to earning a B.Sc. in Computer Science from the University of Washington in Seattle, where my final year’s projects included writing a Bayesian spam classifier in Python, implementing a Java compiler in Java, image recognition in C, and running GIZA++ and now very quaint presentation on using Wikipedia to train machine translation systems and learn named entities.
To support the global developer community I have spoken on varied programming topics at developer events in Eastern Europe, Latin America, the Middle East and North Africa.
German, English, Spanish, Italian, French, Russian and the language now known as Serbian, Croatian, Bosnian and Montenegrin are among the less painful ways to communicate with me. I am one of the younger people with expert knowledge of the microlanguage Danube/Banat Swabian, and understands with reduced difficulty the related Alemannic, Palatine and Yiddish.
On this foundation comes a good understanding of the mechanics of the Balkan sprachbund and the ability to read Hungarian aloud with a good accent without any idea of the meaning. I am currently learning Armenian while trying not to forget the other languages. Further interests include Bulgarian, Romanian, Portuguese, Turkic languages, Arabic and Persian. I claim some knowledge of all living Germanic, Slavic and Romance languages. Regarding writing systems, I can read the Latin and Cyrillic scripts, and the Armenian and Greek scripts are not lost on me.
I wish I knew more Greek, Hebrew and Romani.
A língua alemã é fácil, costuma dizer o professor logo na primeira aula. Aqui está um exemplo disso:
Peguemos um livro em alemão que trata dos usos e dos costumes dos nativos da Austrália, os hotentotes (em alemão, Hottentoten). Conta o livro que os cangurus (Beltelratten) são capturados e colocados em jaulas (Kotter), cobertas com uma tela (Lattengitter) para protegê-los das interpéries. Se nessa jaula coberta com tela (Lattengitterkotter) estiver um canguru, chamamos ao conjunto de "jaula coberta de tela com canguru". Ou seja, Lattengitterkotterbeltelratten.
Um dia, os hotentotes prenderam um assassino (Attentaeter), acusado de haver matado uma mãe (Mutter) hotentote (Hottentotenmutter), que tem um filho surdo-mudo (Stottertrottel).
A essa mulher chamamos Hottentotenstottertrottelmutter e a seu assassino Hottentotenstottertrottelmutterattentaeter.
No livro, os nativos o capturaram e, sem ter onde colocá-lo, puseram-no numa jaula de canguru (Beltelrattenlattengitterkotter). A seguir, incidentalmente, o preso escapa. Iniciada a busca, vem um guerreiro hotentote gritando:
Como se vê no exemplo, a língua alemã é muito, muito fácil!
[Note: Those poor lads actually did not particularly push the theory of linguistic relativity.]
We must read it again and again, even in higher-brow publications like The Economist, how a language's words or categories or even grammar have an effect on the way its speakers think. I will be the first to agree that one has a different personality in different languages. But but but...
Let us consider the case of grammatical gender. What unites Turkish, Armenian, Persian, Tajik, Kazakh, Uzbek, Uyghur, Tatar, Kyrgyz, Malay, Finnish, Hungarian, Indonesian (Malay), Chinese and many other languages in those and other families? They have a genderless pronoun that represents both he and she (and him and her and so on). And what is the result? Or the cause? And if a language has no word for pseudoscience, what then?
Somehow this line of thinking did not become impolite along with other unscientific 19th century inventions about cultures.
It goes without saying that having learnt a language is one of the most satisfying things one can do, but it is not always easy. In 2009 or so after questions from different friends about strategies for language learning, I compiled a single document on the topic.
Below are some motivations and techniques for effective language learning, which of course vary in details among learners and languages. [Project idea: a system that answers the question "How hard is language x?" in a personalised fashion, based on languages the user knows or, even better, based on the results of a quick test.]
I myself should use these techniques with more discipline, lately it has not been the case.
There must be motivations along the way. Correspondence (for example, a weekly email) is the best way, and it is fine if it is very basic. Ideally it forces you to express yourself in terms of what you have done, what you are doing and what you will do. Writing has the advantage that you can take the time to look up how to say things you want to say. If you need to say them often, you will look them up often enough to learn them. If you do have the chance for oral correspondence, it is best with the very young or old, or in general those who do not have any way to communicate with you but the language you wish to learn. The greatest advantage of written correspondence or oral correspondence as described is that you will not be self-conscious. Other motivations can be translating songs, or watching films. With films it is actually effective to watch a film you already know very well and like, so that you do not get too lost, and have context. The same can be said for using software you know in the new language, reading news headlines or articles about events you already know, Wikipedia articles (for example, on the country or state in which you live), and so on. (Watching films with subtitles can be good, but it requires a sort of dual-processor concentration that can be difficult to maintain. You will notice that I am not necessarily advocating watching films originally produced in the language you want to learn, and so on.)
The single most important technique is to keep a list of things you would like to learn or need to learn. I prefer paper, because copying in this case is actually helpful. The list should have the target language on one side, and whatever helps you understand on the other. (For example, "Keif 7alek?" can be explained as "How (is) your feeling?", or followed by an entry for "7al" (feeling).) Whenever you encounter something interesting or funny to you, add it to the list. Whenever you find yourself repeatedly grasping for a phrase, add it to the list. Add words from your favourite subjects - you will need to be talking about these things in any case.
target language | translation / notes ---------------------------------------- | ---------------------------------------- Come stai? | like "?Como estas?" (How are you?) - informal Bene bene | well troppo | too, too much in ritardo | late la gioventu' di oggi | "Kids these days" (lit: the youth of today) l'Italia di oggi | "today's Italy" (lit: the Italy of today) il pullman / l'autobus | the bus Il pullman sta in ritardo | The bus is late. la fermata pullman | bus stop (lit: the stop (of the) bus) fermare | to stop la frenata | the braking / the hard stop frenare | to brake io freno | I brake Ho frenato troppo in ritardo! | I braked too late! Il pullman ha frenato troppo in ritardo! | The bus braked too late!
This is a very condensed summary - your list should be about 10 pages of this - but clearly, the phrases are related in a way. They are organised around a few concepts - the feminine past participle can be used to make a noun, uses of "di oggi" and small but complete sentences that can be said with gusto. This organisation can happen fairly naturally. Whenever you find a phrase, it's a good idea to look up its component words at that point to see if they have other major uses, and also try to find out how to say synonyms or related things. (For example: in Arabic, "maktab" (office) and "maktabah" (library) should be near "kitab" (book), because they share a root. For German, "Zahnarzt" (dentist) should be near "Zahn" (tooth), and putting "Tierarzt" (veterinarian) near is nice too.) This will provide for later epiphanies. A tool like https://dict.leo.org/itde?lp=itde&search=vendita - or just Google - is great for this, and writing the most important phrases using the word is a very good idea.
Do include phrases with words that are the same in both languages. Such words will anchor the phrase, so that you learn it well and can swap in other words. Also include phrases that seem to fit into a larger conversation: "How could you?!" "I'll show that bastard!" Films should provide the most idiomatic translations, and in fact it's most important that it's idiomatic in the language you are learning. For languages where the nouns change according to the case, it's important to bake phrases into your head with a word in different positions.
So what should you do with this list? Cover up one side, and recite, if you are learning the alphabet, write! Ideally a friend will enjoy quizzing you by reading down the list, and you will answer orally. At times I have done this about an hour each day, for months. It is helpful to go from either column, but mostly you should go into the target language. As the first pages become so easy as to be boring, leave them. This will keep the total pages you must do roughly constant.
From a theoretical computer-science perspective, the idea here is to build a cache of n-grams. The facts are that most everyday language is composed of amazingly few words and phrases, and that a computer program linking overlapping n-grams will generate text that seems amazingly realistic, and is usually grammatically correct if semantically nonsensical. (This is a common school project.) The exact contents of the cache will depend on the level of the learner. A beginner will need many basic building blocks of language like "there is", question words, conjugations and so on. As these become natural they can sort of naturally be left out or only included as components of more complicated constructions.
When you pronounce the words, take pride, say them loudly and with your best accent. However you learn them now is likely how you will speak them forever. It is my observation that, because of the difficulty of unlearning, the best accent will develop if there is significant exposure before any learning (of vocabulary or grammar) begins. Tolerating streams of babel is not necessarily fun, but since it leaves one to ponder only the sounds and especially the rhythm of a language, it provides an excellent foundation. Rhythm is actually by far the most important part of pronunciation; for this reason non-native speakers can generally shed their accent when singing. I recognise that putting full effort into an accent is not something that comes naturally to all people, purely from a psychological perspective; self-consciousness is a great hindrance here. Corresponding with humble people with no other way of communicating is one solution, moderate amounts of alcohol are another.
Most of the concepts for spoken learning apply for written learning too.
Maintaing a cache is again a technique. Permanent commitment to memory is in some sense achieved when the word can be read without being sounded out. It is possible to delude yourself if you use a small set, but reading without sounding out is not really cheating; remember that native or advanced readers do not actually read each letter anyway, but rather just rely especially on the first and last letter of the word (in scripts or alphabets - Chinese is a different story). (Even the shape of a word matters: https://www.microsoft.com/typography/ctfonts/wordrecognition.aspx.) Another milestone is the ability to write the word. If you scan write it, you can use all the letters in it when you need to write other words.
As orally, you should not try to learn too much. Learning a new writing system is difficult, but it is much easier if the words you are reading or writing are familiar. Write loan words (like the translation for "telephone" in most languages), city names, names of famous people from the region, etc. (If you are really bored, you can use the new writing system to express a language you know.) If you are lucky, the language you are learning has been written in multiple writing systems. For example, most languages in Central Asia have been written in Arabic, Cyrillic and Latin, and Serbo-Croatian is officially written in both Latin and Cyrillic.
If the alphabets are related - and most are - begin with the few words that contain mostly very similar characters. (For example, "Апотека" is Cyrillic for "Apoteka" (pharmacy in some languages) - only a single letter is really different.) Also, remember that Cyrillic letters like п are practically identical to Greek letters used in maths class, and stand for the same sounds; for example, pi, chi and rho correspond to p, k and r.
In even vaguely phonetic systems, saying the word aloud may help you recognise it, if it is similar to other languages or a word you have heard. In any case, it is always beneficial to read aloud as you practice writing.
In all these recommendations there are a few common concepts. Firstly, in general it makes progress faster to avoid learning multiple new things in one exercise. Secondly, it is important that your prioritise what you learn, by adding to the list whatever arises. Next, immersing yourself is more important than maintaining exact mappings of translations, because the goal is to not be always translating. Moreover, having (the oral or written form of) random but not atypical words and phrases solidly committed allows neighbouring phrases (in terms of, say, edit distance) to be quickly derived or recognised when needed. Finally, motivations, like consuming film, music or software you already enjoy or pursuing human interaction, naturally make you learn easily.
Please do not contact me or anybody with irrelevant job offers or outsourcing offers
You can reach me in the language of your choice via x ꙘȚ bittlingmayer Đ0Ț org
ЏИĐ€Я $ȚꙘИĐ ȚНꙆꙆꙆ$ ?_ Then you’re not a machine. :-)
If that was easy for you, try this one:
ајне ајнфахе вершлисселунг