ניתוח תוצאות הבחירות לכנסת ה 21 באמצעות למידה עמוקה

רבות דובר על השימושים הרבים של למידה עמוקה (Deep Learning), לפריצות דרך בעולמות ניתוח טקסט, זיהוי תמונות, הבנת קול אנושי ועוד. שימושים אלו הנהיגו וינהיגו שינויים אדירים בתחומים רבים בחיינו – מכוניות אוטונומיות, בוטים כעוזרים אישיים, ניתוח ממוחשב של הדמיה רפואית, תרגום טקסט ומה לא.  

אימון והסקה

בגדול מאוד קונספט השימוש ב DL במקרים הנ”ל דומה, ומכיל שני שלבים:
1. אימון (train) רשת נוירונים באמצעות מידע מתוייג, ובניית מודל הממפה בין הקלט לפלט המתוייג.
2. ביצוע הסקה (inference) עבור קלט חדש, וסיווגו אל מול אחד הפלטים מהעבר (או קירוב מספרי – רגרסיה).
 

כך למשל, אם נרצה למשל לבנות רשת המבחינה בין גזעי כלבים שונים – נצטרך להזין לרשת נוירונים תמונות רבות של כלבים שונים ול’הסביר’ לרשת עבור כל כלב מאיזה גזע הוא. את המודל הנוצר נשמור בצד, וכעת נוכל להשתמש במודל זה כדי להסיק עבור תמונות חדשות של כלבים – לאיזה גזע הם משתייכים.

שימוש ב DL לטובת ניתוח נתונים

האם ניתן להשתמש בטכנולוגיה זו לשימושים נוספים שהם לא לטובת ביצוע תחזיות וסיווגים?
האם ניתן ללמוד דברים חדשים שלא ידענו באמצעות שימוש בלמידה עמוקה? 

בפוסט קודם שלי לדוגמא, כתבתי על הערך של ייצוג ווקטורי של מילים לטובת זיהוי דמיון בין מילים לשימושים שונים בעולם ההבנת טקסט טבעי (NLP) ובאמצעות אלגוריתם Word2Vec. מעניין לחשוב על זה כך – ייצוג הוקטורי של המילים הוא תוצר לוואי של רשת נוירונים, ולא הפלט הקלאסי. עם זאת – תוצר הלוואי הזה הוא הדבר החשוב באמת. 
 
הפעם אם כן, בחרתי להתמקד בזווית מעט שונה: שיטות חדשות לניתוח נתונים כתוצרי לוואי של רשת נוירונים.
לטובת ניתוח זה, נשתמש בכלים המתקדמים שמתאפשרים בזכות התפתחות הלמידה העמוקה, בכלל זה – Keras מעל Tensorflow, וספריות ויזואליזציה – matplotlib, ו networkx. 
 
מתחילים, נא להדק חגורות.. 

ארכיטקטורה עמוקה פוגשת את מערכות הבחירות 

פוסט זה נכתב כחודש וקצת לאחר קיום הבחירות לכנסת ה 21 בישראל (בכל זאת לוקח זמן לעבד את הנתונים במקביל ל day job וזמן משפחה), ומקורות המידע הן תוצאות הבחירות בקלפיות השונות לכנסת ה 20 וה- 21. 
 
לטובת הפוסט, פיתחתי רשת נוירונים הממפה את תוצאות הבחירות של שתי מערכות הבחירות ברמת הקלפי הבודדת – כלומר רשת המתיימרת לזהות תבנית הצבעה בקלפי בכנסת ה 20 ובאמצעותן לחזות את תוצאות אותה הקלפי בכנסת ה 21. 
 
נשאלת השאלה המתבקשת – למה לבנות רשת הממפה בין תוצאות של שתי תוצאות בחירות כאשר המידע ידוע ופורסם?
התשובה היא – שמטרתי בפיתוח זה היא לא תהליך ההסקה, אלא שימוש במודל שנוצר כדי לנסות וללמוד משהו על מערכת הבחירות, בשאיפה משהו שלא ידענו קודם (וגם כי זה כיף :).
 
קצת מספרים 
 
בבחירות לכנסת ה 20: התמודדו 26 מפלגות ואזרחי ישראל הצביעו ב 10,412 קלפיות.
בבחירות לכנסת ה 21: התמודדו 43 מפלגות ואזרחי ישראל הצביעו ב 10,765 קלפיות.
מתוך אותם 10,000+ קלפיות בשתי מערכות הבחירות – ישנן 8,565 קלפיות שהשתתפו בשתי מערכות הבחירות גם יחד.
 
כאמור, כעת נבנה רשת נוירונים עם מטרה פשוטה למדי – מיפוי בין תוצאות כל אחת מה 8,565 קלפי בכנסת ה 20 לתוצאות של אותן קלפיות בכנסת ה 21.
כלומר, נבנה רשת עם שכבת קלט של 26 נוירונים – כמספר המפלגות שהתמודדו בכנסת ה 20, ושכבת פלט של 43 נוירונים – כמספר המפלגות שהתמודדו בכנסת ה 21.
בין שתי השכבות הללו, נבנה 4 שכבות נסתרות (hidden layers) מסוג חיבור מלא (dense) עם 35 נוירונים כל אחת. הרשת אם כן, נראת בצורה הבאה:
arch
למה 4 שכבות? למה 35 נוירונים בשכבות הנסתרות ?
ברוב האלגוריתמים ב Machine Learning, מספר הפרמטרים שיכולים להשפיע על התנהגות המודל הינו די מצומם, ולרוב אינו משפיע מהותית על מבנה הנתונים המשמש לתהליך בנית המודל.
 
להבדיל, ב DL למפתח יש אחריות רבה יותר – בנוסף לפרמטרים הקלאסיים כגון קצב הלימוד (learning rate), ומספר האיטרציות (epocs), ואחרים — המפתח נדרש ממש להגדיר את ארכיטקטורת הרשת הלומדת: איך נראת שכבות הקלט והפלט, כמה שכבות ביניים קיימות ומאילו סוגים, מה הן פונקציות האקטיבציה בין שלב לשלב, מה היא פונקציית ההפסד ועוד, ועוד.. כמות הפרמוטציות האפשריות לבנית כל רשת היא אדירה.
מומחים יספרו על כללי אצבע כאלו ואחרים, אבל בתכלס – ניסוי וטעיה is your best friend. 
אז למה 4 שכבות, עם 35 נוירונים בכל שכבה? כי זה עובד!
 
שכבות הקלט והפלט
כאמור שכבות הקלט והפלט מיצגות את תוצאות הבחירות בכל קלפי בכנסת ה 20 וה 21 בהתאמה.
למעשה אלו וקטורים באורכים 26 ו 43 שבכל תא ותא מספר המיצג את המצביעים באותה קלפי למפלגה המתאימה. הואיל וכמות המצביעים בכל קלפי אינו זהה, ננרמל את התוצאות ל 100%, כלומר בכל תא נייצג את מספר המצביעים שהצביעו לאותה מפלגה יחסית לסך כל הקולות הכשרים בקלפי זו.
 

תהליך ה fit ותוצאות ראשוניות

כעת יש לנו מבנה נתונים המייצג מיפוי בין תוצאות כל קלפי בכנסת ה 20 ו 21.
כל שנותר לנו לעשות כדי לאמן את המודל הוא ‘להעביר’ את הדאטה הגולמי דרך מבנה הנתונים הזה, תהליך הנקרא fitting
נבצע 100 איטרציות (שוב, ניסוי וטעיה) ונמדוד גם את איכות המודל. 
 
בדיקת איכות
הואיל והבעיה איתה אנו מתמודדים היא בעיה מסוג supervised learning, כלומר תוצאת הקלפיות למעשה ידועות – ניתן בקלות למדוד את איכות המודל שבנינו.
הדרך המקובלת היא לפצל את הדאטה לשני חלקים, לרוב לא שווים בגודלם – דאטה לאימון (train set) ודאטה לבדיקה (test set). את המודל נבנה על הדאטה לאימון, ונבדוק את דיוקו אל מול הדאטה לבדיקה. 
לאחר 100 האיטרציות, הגענו לדיוק של 89% – לא רע בשביל הזמן שהושקע בפוסט זה, אבל סביר להניח ניתן לשפר זאת.
 
נבחן כמה דוגמאות
נגריל 4 קלפיות אקראיות, ונבחן את הדמיון בין ההצבעה בפועל לבין התחזית, בהחלט ניכר דמיון ברוב המקרים. 
ballot1.PNG
הערך cosine similarity מתאר את רמת הדמיון בין הערכים בפועל לבין התחזית, וערכיו נעים בין 1 – 0. 
(בצורה יותר פורמאלית, ערך זה הוא קוסינוס הזווית הקלואה בין שני הוקטורים – בפועל והתחזית)  
 
 
אבל האמת היא שלא הכל ורוד, נחפש עכשיו את התחזיות הרחוקות ביותר מהתוצאה בפועל – כאן, ערכי ה cosine similarity קרוב מאוד (מידי) ל 0.
ballot2.PNG
אכן תמונות קשות, בקלפיות האלו אין הרבה קשר בין התחזית לתוצאה בפועל… הייתי ממליץ לועדת הבחירות לחזור ולהעמיק את בדיקת אי הסדרים בקלפיות הללו.
 

אכלו לי, שתו לי

אחד הנושאים המרכזיים במערכת הבחירות ה 21 היתה נושא ‘שתיית הקולות’ של המפלגות המרכזיות את המנדטים של המפלגות הקטנות. האם נוכל להעזר במודל שלנו כדי לאשש טענה זו?

נניח שבמערכת הבחירות ה 20, הייתה קלפי שבה 100% מהקולות הלכו למפלגה ספציפית, מסקרן איזו תוצאה יחזה המודל עבור אותה קלפי למערכת הבחירות ה 21.
לדוגמא – אילו היתה קלפי בה 100% מהקולות במערכת הבחירות ה 20 היו למפלגת העבודה (המחנה הציוני דאז), אזי על-פי המודל רק 19% מהקולות היו נשארים במפלגת ‘אמת’, 11% קולות היו עוברים ל’מרצ’ ו 63% מהקולות היו עוברים למפלגת ‘פה’ (כחול לבן). מסתבר שיש דברים בגו. 

אז אם יש לנו מודל שיודע לנבא כיצד מצביעי מפלגה מסויימת שינו את דעתם בין שתי מערכות הבחירות, למה שלא נבחן זאת על כל המפלגות?
לשם כך נייצר מטריצה ריבועית בגודל 26×26, בה כל הערכים הם 0, פרט לאלכסון שאותו נסמן כ 1 – בצורה זו, יצרנו מבנה נתונים של 26 קלפיות מדומות, שבכל אחת יש מפלגה אחת שקיבלה 100% מהקולות.
המטריצה אם כן, תראה כך: 

diagflat.png

בעזרת מטריצה זו נוכל לבדוק בדיוק את נושא ‘השתיה’ ע”י חיזוי תוצאות קלפיות בבחירות 21 באמצעות המודל שנבנה קודם.
נתאר זאת במפה הבאה (צמצמתי ל 13 המפלגות החשובות בכנסת ה 21):ballot-map.png

כמה אנקדוטות מעניינות:

1. כאמור, מפלגת העבודה קיבלה את אמונם של 19% ממצביעיה המקוריים, 65% נדדו למפלגת כחול לבן
2. את תואר המצביעים הנאמנים ביותר, מקבלת מפלגת יהדות התורה, עם 90% לויאליות. 
3. 30% ממצביעי הבית היהודי התפוגגו אצל מפלגת הימין החדש (שכידוע לא עברה את אחוז החסימה)
4. מצביעי כולנו התפזרו להם אצל המון מפלגות, בינהן – שס, כחול לבן, הליכוד, רע”ם-בל”ד ויהדות התורה
5. כ 25% ממצביעי ישראל ביתנו הצביעו לחד”ש-תע”ל בכנסת ה 21. מוזר מאוד! אשמח להסבר בתגובות.
6. מפלגת הליכוד, קיבלה את אמונם של 10% ממצביעי יש עתיד, אולם מפלגת כחול לבן (הנושאת את אותם אותיות) לא הצליחה להעביר מנדטים משמעותיים מהליכוד אליה. 

ברור שרמה כזו של תובנות קשה יהיה להסיק באמצעות שיטות קלאסיות הנהוגות בכלי Business Inteligience הרגילים.

טביעת אצבע וקטורית של מפלגות 

כאמור בפתיחה, אחד הסודות הפחות מדוברים בעולם ה DL, הוא תוצרי הלוואי של תהליך החיזוי – בין אם בשכבות הנסתרות, אבל גם בשכבה הסופית. במקרה שלנו למשל, ניתן לתאר את מספרים האופקיים לכל מפלגה במפה שלמעלה, כייצוג במרחב הוקטורי של אותה מפלגה – אני קורא לזה party embedding.

נתבנון שוב על המפה, ממקודם, אבל נתמקד לרגע בשתי מפלגות – מפלגת העבודה, ומפלגת כחול לבן

ballot-map - sim

מסתבר ששתי שורות אלו די קרובות אחת לשניה – הזווית הקלואה בין ראשית הצירים (0, 0, 0, …, 0) לבין הנקודה המתוארת בשתי השורות כמעט זהה (אם ננרמל את התוצאות לכל שורה ל 100% נראה זאת בבירור, אבל לא הפעם).
המשמעות הלוגית היא, שמצביעי שתי המפלגות בכנסת ה 21, הגיעו בגדול מאותן מפלגות ובאותם יחסית בכנסת  ה 20.

ניתן לחשוב על הווקטור האופקי כטביעת האצבע של המפלגה – מפלגות הדומות אחת לשניה (בהיבט מצביעים, לאו דווקא אידיאולוגי) ייקבלו ייצוג דומה – כלומר ערכי ה cosine similarity יתקרבו זוג כזה ל 1.

עכשיו נחשב, את המרחק בין כל זוג מפלגות – אלו הצימודים הבולטים ביותר:

dist.PNG

אין יותר ימין או שמאל?

זוכרים את הסלוגן? נבדוק…
ננסה לתאר את מפת המפלגות בכנסת ה 21 על גבי קנבס דו מימדי (בעקבות מגבלות המדיה כמובן) – כלומר עלינו לבצע המרה של הנתונים מוקטור באורך 26 תאים לווקטור עם שני תאים מבלי לאבד יותר מידי מידע.  
ישנן מספר שתי שיטות מרכזיות לעשות זאת – אלגוריתם מסוג הפחתת מימדים: T-SNE, ואלגוריתם התמרת מימדים: PCA.
בחרתי דווקא שיטה אחרת – שימוש בגרף ממושקל, כאשר משקל כל קשת הוא כמרחק בין כל זוג מפלגות.
את הגרף נדפיס באמצעות הספריה networkx (את הקשתות צבעתי בלבן), המשתמשת באלגוריתם איטרטיבי המביא את הצמתים למרחק בו המערכת מגיעה לשיווי משקל מכאני.

וקיבלנו את המפה דו מימדית להלן:

votes-map.PNG

גרף זה אם כן, מייצג את המרחק בין כל זוג מפלגות – מעניין לראות כיצד המפלגות השונות הסתדרו להן במרחב – נראה שלמעלה הצטופפו מפלגות המזוהות יותר בשמאל ולמטה מפלגות הימין.
איך הייתם מגדירים את הציר האופקי? כיתבו בתגובות..

לסיכום

למידה עמוקה היא יותר מרק ביצוע תחזיות, קלסיפיקציות וכיוצ”ב – הכלים החדשים פותחים לנו עולמות חדשים של אפשרויות בתחומים משיקים, למשל בעולם ה Business Inteligience.
שימוש מוסכל במידע המצטבר בשכבות הנסתרות או בתוצרי החיזוי, יכולים לפתוח את עיננו לסודות מעניינים מאוד, אם רק נדע להפיק אותן. 
האם חברות האנליטיקס וה BI ישנו את עורן כדי לספק שירותי דוחות חדשים מבוססי DL? אני מעריך שלא מייד, אבל בהחלט נראה שיש פה הזדמנות מעניינת לחדשנות.

הקוד כולו נמצא בלינק הבא.

Happy Deep Learning!

Natural Language Understanding using Deep Learning

The field of Natural Language Processing (NLP), or in its more inspiring name – Natural Language Understanding (NLU), is quite old in the computer science world.

Since the 50th of the last century, scientists have been looking for ways to automatically process human language. The algorithms developed till the 80s were mostly based on set of manual rules, and later based on Machine Learning algorithms.

Throughout the years, significant achievements had been accomplished – parsing of sentences, entity extraction, part of speech tagging, topic modeling, classification of text, categorization of text, translation between languages, translation from text to voice, translation from voice to text, sentiment analysis, automatic summarization, and more..

It seems like that analyzing Shakespeare’s writing with his perfect syntax is a doable task.

Unfortunately, human had started creating more complex texts, much more complicated than the English’s genius – such texts that the traditional NLP algorithms had failed to deal with.

Apparently processing a true natural language is not a trivial task at all. The existing NLP Parsers are very fragile and often failing processing free written texts that do not follow the rigid set of rules of a language.

Deep Learning for help

Deep Learning is an old/new field in Computer Science, or to be more precise – it’s a rebranding of some very old algorithms from the family of Neural Networks.

In the core of this methodology, there are multiple (hence deep) layers of Neurons connected with varying weights that are calculated during the training phase of the net. I will not further detail the technical aspects of DL, as it is not the main topic of this post and there’s a lot of content available online.

In the past years this field had gain tremendous progress thanks to some groundbreaking achievements in the fields of visual and voice recognition, for example:

  • Image recognition: at the end of 2014, Google published a post about the ability to describe an image and the elements inside.
  • Voice recognition: Microsoft publish a post at the end of 2016 claiming they have achieved a human parity in conversational speech recognition.

Recently, Deep Learning is also mentioned frequently with respect to NLP, due to great interest in the world of analyzing and processing human language – among these: free text, emojis, typos, syntax errors, acronyms and other generation Y text. Add to that the buzz around chatbots and there you get a complete fuss.

What had been changed?

From the data angle

It’s a well known fact, that having data is a preliminary condition in order to develop data learning systems.

At 2010, a project named imageNet started (mostly by Stanford and Princeton researches), in which over 10 Millions images were manually tagged. This dataset is now available for everyone and its existence jump started the research of Deep Learning algorithms, especially using Convolutional Neural Network that are very suitable for image recognition problems.

Textual data for research was also becoming more accessible than ever before – SNAP from Stanford, labeled reviews and rating of movies from IMDB, the entire Wikipedia is available for download, and more.

From the hardware angle

Neural Network algorithms ‘like’ to run on graphic cards (GPUs). These cards are categorized by their ability to parallel process massive small and simple calculations. Market benchmarks are showing x50 to x100 improvements in training time compared to CPUs. And thus, companies such as Asus, AMD, Intel and NVIDIA, have started developing graphic cards that are optimized for Deep Learning algorithms (and not just for first-man-shooter kind of games)

From the academic angle

The field of Neural Networks had known ups and downs over the past 70(!) years. In the recent years, a tremendous amount of academic researches had been conducted by top universities (led by Stanford) and research departments at Google, Microsoft and IBM.

Specifically in the NLP world, there had been a significant progress in the subject of text and words representation. More on that later in this article…

And from the software angle

In my oppinion, what placed Deep Learning in the center of the stage was the fact that Google released at the end of 2015 – Tensorflow, an open source library to develop Deep Learning networks which became extremely popular with over 44K starts in GitHub.

* correlation does not implies causality, and thus the explanation can also be exactly the opposite.

Automatic learning of features

Beside the great fuss created around Deep Learning, this field is truly leading a change in the way of developing learning systems.

Classic learning systems usually requires a pre-process phase of the data – called feature extraction or feature engineering. In these processes, the researcher tries to find attributes (features) in the data that their presence, non presence or presence with other features, might explain a trend in the prediction. These processes in many cases require a good domain knowledge and significant statistical background, and thus they tend to be very manual, sisyphic, and time consuming.

In NLP specifically, the researcher is expected to understand in grammar, morphology, language formalization, pronunciation, syntax, and so forth – of every language he’s dealing with in order to create the relevant features for the learning system.

For example:

  • The prefix of ‘un’ at the beginning of a word, changes the word meaning – for example: uninterested.
  • Understanding sequences – for example: pun intended.
  • Relying on syntactic and semantic annotated texts, for example – treebank.
  • Relying on external lexical database to identify part of speech elements, synonyms, antonym, for example – wordnet.
  • Relying on external map of names, locations, products, etc..

Eventually, even a professional text in a newspaper can confuse a human being:

It will probably be exaggerating to say that feature extractions processes will disappear thanks to DL algorithms, but a change in course is clearly shown. Instead of letting the researcher extract features, we’ll ‘throw’ the data on the net, and it will find the relevant features automatically and weight them correctly.

The change in course, than is – instead of manually extracting features, the main task is first to represent the data correctly in order to let the system identify the features automatically.

Data representation

Classic NLP algorithms are usually trying to represent words/sentences/documents using a vector or a matrix of numbers. There are quite a few popular methods to do so: one-hot, BOW, CBOW, TF-IDF, Co-occurrence matrix, n-grams, skip-grams and more.

For example, in the simplest one-hot representation-

  • The word ‘plane’ might be represented by a vector: [0, 0, 1, 0, 0, 0, …. , 0, 0]
  • And the word ‘airplane’ might be represented by a vector: [0, 0, 0, 0, 1, 0, …. , 0, 0]

The size of the vector is as big as the number of all words in the text corpus. This type of representation creates two main problems:

  1. Sparsity – the number of dimensions (vector length) needed to represent a single word is number of total words in the corpus, it can easily reach 10s of thousands of words or more. Clearly this representation is inefficient and requires significant computation resources to feed it into a learning system.
  2. Terms relationship – the word ‘plane’ and the word ‘airplane’ are synonymous and interchangeable. These representation methods misses this information which is crucial to best understand the text.

What’s required is a way to represent text in an efficient and in a low dimensionality vector space.

Word representation (word embedding)

In 2013 researchs from Google, published a paper describing how to represent words in a vector space – a paper which deeply influenced the NLP world and the use of Neural Nets to do so. Google also published the code behind this paper under the name – Word2Vec (based on Neural Network of course). This algorithm can take a large corpus of text and create for each word, a vector representation in a selected size (usually 50-200 dimensions).

What’s really interesting about the outcome of this algorithm is that there’s a linear meaning for distances and angles between words.

In this example (in a two dimensional space), the words ‘plane’ and ‘airplane’ are very close to each other. Additionally, the distance and angle between the word ‘plane’ and ‘sky’ is very similar to the distance and angle between the words ‘car’ and ‘ground’.

Another fascinating aspect of this algorithm is that you can run it on any text, in any language without the need to manually craft features in advance, and you will still get a data structure with linear characteristics.

Instagram published in 2015 a post in their tech blog, an interesting research about the use of emojis using different tools – including also word2vec.

It’s amazing to see which words are closest to each emoji:

😂 ⇒ lolol, lmao, lololol, lolz, lmfao, lmaoo, lolololol, lol, ahahah, ahahha, loll, ahaha, ahah, lmfaoo, ahha, lmaooo, lolll, lollll, ahahaha, ahhaha, lml, lmfaooo

😍 ⇒ beautifull, gawgeous, gorgeous, perfff, georgous, gorgous, hottt, goregous, cuteeee, beautifullll, georgeous, baeeeee, hotttt, babeee, sexyyyy, perffff, hawttt

Many applications in the field of NLU are now using representation of words in a low dimensionality space using word2vec (or GloVe which works in a statistical manner but creates very similar outcome) as an input to a Deep Learning network – a process called: pre-training.

Understanding sequences – Recurrent Neural Network

In order to predict a real-estate property value, you can refer to its location, size, year of built, etc.. As a matter of fact, the order of the features doesn’t matter mach – solely their existence.

In text on the other hand, words do not stand by themselves. A word meaning can change according to words that are coming before or after it.

Traditional learning systems can’t natively handle sequence of features depending on time and order. Thus, already in the 80s a class of algorithms from the family NN developed aiming to solve this limitation, called – Recurrent Neural Networks.

RNNs are very similar to regular NN, with one major difference. The output of every layer is also the input of the same layer for the next step. This type of a feedback loop architecture enables the net to ‘remember’ the information from the previous step (which actually accumulated till now), and thus to enable representation of sequences.

Apparently, this architecture works really well. In the past years researchers had accomplished to solve problems in the fields of voice recognition, translation, sentiment analysis and more, using variation of RNNs (mostly LSTM and GRU). It is clear that RNN is the best way to describe human language, and most of the recent researchers are using it.

The future – ask me anything

Is Deep Learning the solution for all NLP problems? It is surely seems like this is where things are heading. Using smart Word Embedding techniques together with variations of RNNs, researchers had outperform almost any other classic algorithm in the tested NLP problems.

Is this the end of the story? It seems like it’s only the beginning.

At this stage, the ability of learning systems is summarized by the ability to crunch data and use it to do a very specific task (e.g. predict the cost of a property). Can we build models of Artificial Intelligence that can answer any question? It is still early to say.

The top task in the field of AI is called – Artificial General Intelligence, an intelligence that can handle tasks at the level of a human being, including: using judgment, intuition, logic, being self conscious, ability to communicate, ability to learn and more.

It is probably going to take some time to accomplish a complete AGI, but recent papers and articles show the roadmap to get there:

  • In February 2016, researchers from Facebook published a paper about the roadmap to develop smart learning systems. In their paper they have referred to these two main capabilities: the ability to communicate and the ability to learn. The assumption is that the entire human knowledge has been digitalized and is available to all easily. All we need is a machine that will be able to read and learn from it.
  • Researchers from Salesforce published a paper in March 2016, in which they presented a model of Dynamic Neural Network able to deal with free questions and answers during a conversation dialog.

It is hard to predict the exact future, but it is safe to say that the pace of development is crazy. The time it takes from an academic paper to be published till it gets to be open sourced code in GitHub is just ridiculous. It seems like significant developments will be shown in the coming years. Hold tight.

Wrapping up – I’d like to learn more

The field of Deep Learning creates great opportunities for developers who are not coming directly from a Data Science background, thanks to few reasons:

  • It’s a new technology, a new paradigm – ramping up is required from everyone whether you come from a closeby or distant field. The pace of development is so rapid, it is sometime more important to learn about the trends and changes rather than focus on this or other specific algorithm.
  • The reduced need of feature engineering.
  • Development of wrappers (e.g. keras.io), that simplifies the development over the raw libraries (e.g. tensorflow)
  • The support of many programing language – python, java, lua and even some interesting libraries in javascript.
  • And specifically about NLP – it’s is very easy to get text to train on.

In these two links [12] there’s a curated list of items related to NLP and DL.

If I had to recommend on software libraries to develop NLP + DL applications, these were the two:

  • Gensim – especially for its easy to use word2vec algorithm (Python), but also for other topic modeling algorithms – LDA, LSI, etc..
  • Keras.io – probably the easiest to use DL library (Python) on top of Tensorflow or Theano. Quiet recently in January this year, Google announced that they plan to make Keras it’s default API for Tensorflow.

Happy Deep Learning !

AI will get you fired

Human’s brain is a pretty impressive organ – about 100 billions neurons and 150 trillions synapses, make it an amazing machine that can think, create art, discover science.. they put a man in the moon (if you believed).

But then, it has its limits. Just try to multiply two 4 digits numbers, and it will take you ‘ages’ compared to simple calculator. If you do the same with 10 digits numbers, you’re most likely to give up quickly, saying – I can do that, but why bother; a computer can do it much faster and more accurate.

Artificial Intelligence is changing the rules

The last statement is now being reinforced with the recent research of AI and deep learning. Simply put –

If your job is to take a decision between a finite number of options, a machine is more likely to do it better than you, sooner than you’d expect.

This rule applies to many of the tasks / occupations we have today –

  • Cab drivers (left, right, gass, break)
  • Brokers (hold, buy, sell)
  • Campaign managers (audience, creative, budget)
  • Medical Doctors (diagnosis, treatment, prognosis)

What?? Medical doctors? they have tons of experience…

Experience can be trained

Think of the training experience using the following example – given a set of symptoms, clinics and background, the MD’s job is to determine the diagnosis, treatment and prognosis. During their many years of study, young doctors are training their brain to be ‘wired’ correctly in order to find the right match between the two vectors – the input and the output. They are also learning what is the ‘penalty’ of wrong treatment and how to avoid it.

The more experience they get => the more incidents they see => the more accurate their conclusions are.

Now, what if we apply the same logic on machines?

Artificial Neural Network

ANN is nothing new, since the late 40s, researches have been intrigued by the human’s brain and how to artificially mimic its behaviour. While research had been halted for few decades, latest interests in deep learning and the increase in parallel processing capabilities brought this discipline back to the front light.

Quite similar to the above example, ANNs are made out of 3 layers – input layer, output layer and (multiple, hence deep) hidden layer(s).

Without getting too technical, training the network is a process in which its goal is to find the optimum weightsbetween each neuron in the different layers. The more examples you ‘feed’ the network (both correct and incorrect ones), the better it gets in finding those optimum weights and the more accurate the output eventually is.

Since the number of occurrence a machine can ‘eat’ is practically unlimited, it has an unfair advantage over the human’s brain, and thus its accuracy will eventually be on the upper hand.

The sad thing about ANN is that one cannot easily ‘reverse engineer’ them. It is not always trivial to identify the relationship between the input and output by simply following the thickest path in the network, especially on complex nets. Think of ANN as a black box that can resolve complex problems based on past occurrence, but not necessarily can be used to explain the reasoning behind it.

AI is the Tractors for the White-Collar Jobs

Progress is being made ridiculously fast, as new startups are rising daily trying to take a bite of many of the different routine jobs in the market.

While Artificial Intelligence will probably reduce jobs, it will also created new ones and will make many processes far more effective and cheaper.

The future jobs will be different, for sure. Decision processing will be made automatically, just like that no one today will bother multiplying two 10 digits numbers. Humans role will not be around taking decisions, but more around setting up the gameplay – what are the possible inputs and what are the possible outputs.

Conclusion

With enough data, machine will always win humans in taking the right decisions, and it’s going to be dramatic to the employment market.

Nevertheless, decision-oriented-professions jobs, are not going anywhere yet, we will still need them for the foreseeing future. But in the era of Artificial Intelligence, regardless of the amount of experience the individuals possess, their jobs will be changed.

MDs, for example, will then need to ‘help’ the system take the right decisions by feeding in the non-quantitative measurements (e.g. “patient is feeling pain in the chest and it makes him very stressful”) and by setting up the potential outputs – what are the different type of treatments, and diagnoses.

What’s left for us ? Setting up the gameplay is still on the human’s shoulders, and new professions will rise around this need.

It’s going to be fascinating, hold tight!!

Being a Busy “B” won’t get you promoted

Do you know Michael who’s working with you?
He’s the guy that usually eats his lunch while running from one meeting room to another, he’s the guy who’s always last to leave the office, he’s this guy that won an iPad at the company yearly conference for his “great achievements”.

Michael is a Busy Bee, or better said – a Busy “B”.

There are 3 types of employees:
“A”s – those who set the rules
“B”s – those who follow the rules
“C”s – those who don’t get the rules

Michael is the perfect employee – he gets things done, managers love him, and when the shit hits the fan, he’ll be there to save the day.

Generally speaking, Michael likes his job. It “challenges” him, he thinks… 
Yes, like any other jobs he’s dealing with many boring day to day tasks and really wishes to take a leadership position one time in the future (although he’s not entirely sure what does it really mean).
He feels that he’s getting a good progress in his personal career – he’s learning new stuff and he had even attended a couple of meetings with the company’s CEO over the past year. But yet, he still worries he’s lacking enough ‘field’ experience in order to walk the extra mile.

Michael’s career is stuck. 
It’s stuck because he has the perfect skills to walk on the path, but he unfortunately can’t draw the path. 
He’ll probably be moving horizontally eventually – replacing Stephanie who’s going on a maternity leave soon, but it will be the third time he’s taking this direction.

How would he break the glass ceiling?
He thought to talk to his manager about it, but was too shy, and the 15 minute weekly meeting with her – ended up covering only the burning issues, as always.
He’s full of ideas, but his manager never takes them seriously, stating that he should focus on his current job first.

In order to take the extra step, Michael will need to OWN the leadership and not to wait for someone to hand it over to him.

While employees struggled to finish their micro tasks, companies actually are longing for true leaders to look at the big picture and set their courses.

If you’ve got the passion, you’ve got the energy to make a difference – don’t wait for someone to hand you the time to do it. Just do it.

Don’t ask permission, ask forgiveness. 
People don’t get fired, for taking a lead and failing, and if you do – remember it’s not a company worth working for from the start.

Big Data 10th anniversary – where do we go now ?

When Google first introduced us to their internal secret sauce about massive data processing, back in December 2004, the term ‘Big Data’ was not yet indicating any of the fuss we’re witnessing since then.

As a matter of fact, the term was used before several times – but as far as I concern, Google’s document was the moment in time when the data era had started. That was 10 years ago.

Few months after, Doug Cutting made history by developing Hadoop.
Heavily influenced by Google’s document (and his son’s favorite toy), Hadoop rapidly changed the technology ecosystem, and amazing new tools leveraging common concepts, started to pop out like mushrooms after rain.

Happy birthday Big Data, you’re 10 years old !
Oh, sweet child; Where do we go now?

I believe that the coming 10 years are going to be just as interesting for Big Data:
New technologies relaying on Hadoop as infrastructure, the rise of data science and the ease of data analytics, sophisticated computer vision, trends like Internet of Things and wearable devices – all and more are going to take us to new, interesting, sometime unpredictable, directions.

Nonetheless, I’ve decided to take a bold step, and try to predict how the coming 10 years for big data are going to look like.
In the coming posts, I’m going to lay down my forecasting about the future of big data, 10 years from now, focusing on 3 categories – technology, data analysis and science, data products and privacy. This post will be focused on technology.

Big Data Technology – Where do we go now ?

First thing first, lets see how technology is going to evolve in the coming 10 years. 
I believe that data platforms are set on 5 peelers: Data Repositories, Data Transformation, Data Retrieval, Data Visualization and Data Science. 
These are my 6 predictions for the coming 10 years, related to the above peelers.

1. Data Repositories: Hadoop will become THE platform for any data driven technologies.

Hadoop as a platform had gone quite a way since it was first developed as a distributed file system and MapReduce enabler.

10 years after and we are now witnessing many new technologies built on top of the Hadoop File System (HDFS), leveraging the parallelism, high availability and robustness of the system.
It’s no longer ‘just’ a batch oriented system – you can now run interactive queries, using Impala and Spark, build search engines using Solar and ElasticSearch and run real time event processing using Spark Streaming, all governed by YARN.

But we’re kind of stuck.

On its very core, HDFS is a distributed file system; as a matter of fact, it’s a pretty lame one. It can only support immutable objects, can’t handle many/small files very well, quiet slow when direct accessing, and other limitations.

These limitations are now holding Hadoop from its true destiny – becoming the infrastructure for any data driven technology riding on its scalability and popularity.

That has to change from its very core architecture, HDFS needs to get all the basic feature of any other respectable filesystem. MapR had already made a significant progress in this direction.
Once HDFS becomes a true filesystem, with equal features to the ext* family and NTFS file systems, we’ll see the true adoption of data infrastructures really utilizing its potential.
At first, we’ll see the ‘new age’ data infrastructures migrating into HDFS – Cassandra, Elasticsearch, MongoDB and others, but then the giants will have to follow as well:

  • Do you need a filesystem to host your millions images ? use Hadoop.
  • Oracle running on HDFS ? you bet! it’s just a matter of time.
  • Looking for a storage solution with easy search capabilities ? Try Lucene on HDFS.
  • Real time processing of data ? Spark Streaming.

2. Data Transformation: ETL = CEP = Spark

ETL stands for Extract Transform and Load, or in English – pull the data, change it, and stuck it back somewhere else. Hadoop’s MapReduce makes the ultimate ETL infrastructure, especially the ‘T’ part of it – it lets you easily manipulate any amount of data in a super effective way.

CEP stands for Complex Event Processing, or in English – get data in real time, manipulate it (usually by joining with other data sources and/or aggregating it over a window of time) and take an action upon it or stuck it somewhere else. Real Time data manipulation is today’s promise using tools like Storm, Spark Streaming or any of the other commercial CEP solutions.

At the end of the day, both concepts are very similar. The main difference is that ETLs are usually considered to be batch processing jobs while CEPs are more real time creatures, but at their core, they do the same job – manipulate data and move it forward.

When you get to develop an ETL+CEP environment, you often ask yourself how to correlate these tasks together – I want real time signals and also 3 years of data based on the same logic.

Nathan Marz, the ‘father’ of Storm, is addressing this question by proposing the Lambda architecture, designed to answer any data question with any freshness between 100 ms to 100 years. I highly recommend reading his writings/books, but if you didn’t get the chance yet – in a nutshell what he’s suggesting, is to stream the data through a real time aggregation process (CEP) and also through a batch aggregation process (ETL), and to join the two in query time in order to get accurate data with no latency and no limits. While this is nice and elegant solution, this concept has an inherited flaw – if you develop two data manipulation processes, you’d end up getting two un-synchronized results.

What’s the solution ?

The new kid in the block – apart from its main quality as a MR replacement and a fast data processing engine, Spark it is also a unified mechanism to develop any kind of data manipulation jobs, both batch and real time oriented.

In the coming years, we’ll be witnessing Spark getting a larger and larger footprint and in 10 years (together with 3rd party tools leveraging it’s code) it will dominate the data transformation world both for batch and real time jobs.

3. Data Retrival #1: The return of SQL – Structured Standard Query Language

NoSQL is a pretty successful paradigm – tech companies behind Mongo, Cassandra, CouchBase and others have gained a huge traction and success in past 5 years. There are two main reasons for that –

  1. SQL databases were not ready for the data boom – even today, scaling a MySQL cluster is not a trivial task (yet possible). Oracle and SQL Server are still behind.
  2. Modeling data in a two-dimensional structure feels unnatural (good luck modeling user’s multiple purchases with multiple products in each purchase in a SQL database).

Nevertheless, having an easy method to query data is crucial – you should not be a developer to run queries.
Almost each of the companies behind the NoSQL companies have created a proprietary, often strangely resemble to SQL, language to query their database. But as of today, there’s no single standard to query NoSQL DBs.

Standards are crucial for the economy!
From the customers’ point of view – it lets them easily replace once piece of technology with another. As long as both parts ‘speak’ the same language, the goal is make this transition nearly seamlessly (unfortunately, this never actually like that).
From the vendors’ side – it lets them develop solutions that can easily scale, since they don’t require expensive integration projects.

We’re lacking a Standard Query Language (SQL?) that will be adopted by everyone, including the NoSQL players to get the real movement going. This language, which should probably be based on extensions of the good old SQL – should support 2d tables and also complex documents, including: records, arrays and maps. Surprisingly enough, the leading technology in the current SQL world that is already doing so is PostgreSQL.

4. Data Retrieval #2: Big Data Analytics – select anything from anywhere where whatever

In the old ages (5 years ago), when a manager wanted a new report, a request was fired to the R&D department to develop this new capability.
R&D had to take into consideration – the report UI, data modeling, ETLs, aggregations/cubes, scheduling, backups and so forth. It could have taken few months till that request was satisfied.

The main reason for that was due to the fact that queries in the traditional databases were just not fast enough – querying even just few millions of records in Oracle can take minutes, which is pretty lame compared to the slick experience you’re used to when running Google Analytics.
In order to increase query performance, you had to pre-calculate the results and serve it when crunched and ready. If the manager then asked to filter by State – another 3 months could have passed again because the aggregation that was created does not maintain this hierarchy.

Prof. Michael Stonebraker, the mind behind c-store and Vertica – revealed the world the simple fact, that some data formats are meant for writing (row oriented) while some other data formats are meant for reading (column oriented). Based on this ground, he founded Vertica, which was later acquired by HP.
Thanks to Mr. Stonebraker (conception-breaker), we now understand that it is possible to query anything on any filter without pre-calculating in advance.

While it works very well for technologies such as Vertica, Greenplum, Amazon Redshift, Sybase IQ and some others – we’re still lacking a serious breakthrough in the Hadoop ecosystem.
True, the well communicated war between Impala vs Hive Spark and the file formats Parquet and ORC – looks very interesting and one will have the upper hand eventually. But it is clear that both are 3-4 years behind the none Hadoop vendors.

5. Data Visualization: beyond bars and gauges – visualization of documents

It is really funny how the BI vendors of the world, have built their tools driven mostly by the limitation of the underlying repositories of two-dimension models and not by how human are actually thinking. It’s even funnier that we’ve got used to it.
Every kid can take a table in Excel and make different charts out of it, but how would you visualize JSON file ?

The NoSQL movement have catched the BI players off-guard, they were just not built to visualize complex objects, just tables. It looks like they were just hopping it would go away somehow.

But the world had a different plan – it needed a solution.

3 guys have raised the glove and developed D3js – a data driven visualization library aiming to go beyond the simple art of visualizing 2d tables. Using d3 (and other similar libraries) you can now chart graphs, tree maps, word clouds, chord diagrams and more. Really amazing stuff.

Since then, many of the BI players had also started to adopt these concepts, embedding them into their tools as well.
While today data visualization is 99% 2d tables driven and 1% document driven – I think that in 10 years it will be 50-50.

6. Data Science: Commoditizing data science

Today when a developer wants to develop a market basket analysis, real time prediction systems or find exceptions in financial data – he needs to be able to fluently discuss about: entropy of continues variables, binning of un-discrete data, standard errors of proportions and r square distance from a line.

Data science is a sexy job, but it’s pretty darn complicated. Sex shouldn’t be complicated.
The data science problem had been too long in the hands of mathematicians and too little in the hands of developers.

I know data, but I’m not really sure how the electricity to my house really works. It just works. Black boxing is the ultimate answer to complexity.

I think that data science in 10 years will be the BI of today – you’ve gotta have it in your stack, and all you’ve got to do is to hook the wires correctly and it works like a black box charm.

TL;DR

10 years anniversary to Big Data since Google’s document about MR back in Dec 2004. These are my 6 predictions for the coming 10 years:

  • Oracle will run on Hadoop
  • ETLs and CEP will be done together with Spark.
  • SQL will support document data and will be adopted by the NoSQL DBs.
  • We’ll be charting a lot more complex documents and less 2d tables
  • Developers will be able to run predictive models without a Ph.D in Mathematics.

Next posts will focus on data analysis/science and data products.

  • Recent Posts

    • Natural Language Understanding using Deep Learning
      Deep Learning, NLP
    • AI will get you fired
      Uncategorized
    • Being a Busy “B” won’t get you promoted
      Uncategorized
    • Big Data 10th anniversary – where do we go now ?
      Big Data