{"id":985,"date":"2019-07-21T14:37:13","date_gmt":"2019-07-21T12:37:13","guid":{"rendered":"https:\/\/blog.zhaw.ch\/datascience\/?p=985"},"modified":"2019-07-21T14:38:51","modified_gmt":"2019-07-21T12:38:51","slug":"twist-bytes-vardial-2018","status":"publish","type":"post","link":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/","title":{"rendered":"Twist Bytes @Vardial 2018"},"content":{"rendered":"\n<p> by <a href=\"https:\/\/www.spinningbytes.com\/author\/benf\/\">Fernando Benite<\/a><a href=\"https:\/\/www.zhaw.ch\/en\/about-us\/person\/benf\/\">s <\/a>(ZHAW and <a href=\"https:\/\/www.spinningbytes.com\/\">SpinningBytes<\/a>)<br><\/p>\n\n\n\n<p><em>cross-posted from the SpinningBytes <a href=\"https:\/\/www.spinningbytes.com\/twist-bytes-vardial-2018\/\">blog<\/a><\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">schwiiz ja*<\/h2>\n\n\n\n<p>This year, the SpinningBytes team participated in the VarDial competition, where we achieved second place in the German Dialect Identification shared task. The task\u2019s goal was to identify, which region the speaker of a given sentence is from, based on the dialect he or she speaks. Dialect identification is an important NLP task; for instance, it can be used for automatic processing in a speech-to-text context, where identifying dialects enables to load a specialized model. In this blog post, we do a step by step walkthrough how to create the model in Python, while comparing it to previous years\u2019 approaches.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>In the data, which was provided for this task, we are given a training set, consisting of around 17.5 k individual sentences which do not contain any context or additional information. Each sentence in the training data has a label belonging to four different cantons, which is a rough approximation for the dialect, especially since they are close to each other, spanning a region of 80 km times 80 km. The Swiss German sentence in the title of this article (* which means \u201cswiss yes\u201d in English) was taken from the corpus and is assigned  to the Bernese dialect. The following are other interesting sentences: \u201cech was\u201d (approximate English translation: \u201creally something\u201d,&nbsp; dialect  of Lucerne); \u201cim diskutiere han i immer gern r\u00e4cht gha\u201d (approximate  English translation: \u201cin discussions, I like to be right\u201d, dialect from Zurich). Another concrete application motivating such a task, is to better differentiate between languages, especially problems such as identifying the language of a tweet, since it is often a colloquialism or transcription of spoken language. In the example of a popular program for language identification, a probably African-American tweet (clearly EN-US) was identified as Danish, which shows the importance of solving the task (see the following paper for more information: <a href=\"https:\/\/arxiv.org\/abs\/1707.00061\">https:\/\/arxiv.org\/pdf\/1707.00061<\/a>).<\/p>\n\n\n\n<p>The data from this year is basically the same as last year\u2019s, with the difference that now we have a development data set which is the last year\u2019s test data<strong>.<\/strong>&nbsp;We based our approach on the winning architecture from last year\u2019s VarDial, but extended it in several way. The results from last year  were as follows:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017.png\"><img decoding=\"async\" src=\"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png\" alt=\"\" class=\"wp-image-3594\" \/><\/a><\/figure><\/div>\n\n\n\n<p><em>Results from VarDial GDI 2017 (for comparison)<\/em><\/p>\n\n\n\n<p>Below we show step by step how to increase the performance on this dataset, starting from scratch. Also, while building the system we compare the  performance with the results above. The code is available  under&nbsp;<a href=\"https:\/\/github.com\/spinningbytes\/Vardial_blog\">https:\/\/github.com\/spinningbytes\/Vardial_blog<\/a>. One important  thing to note: last year, weighted F-1 scores were used as evaluation measure, this year, it is macro F-1.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Hands On<\/h2>\n\n\n\n<p>You might now want to clone the git hub repository <a href=\"https:\/\/github.com\/spinningbytes\/Vardial_blog\">https:\/\/github.com\/spinningbytes\/Vardial_blog<\/a>. You can run the baseline.ipynb (and \u201cabove\u201d) in a jupyter python session. Below, we are going to reproduce the script.<\/p>\n\n\n\n<p>We will first load the task data into the directory.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#based on  https:\/\/github.com\/ynop\/audiomate\/blob\/master\/audiomate\/utils\/download.py\n\n\nimport zipfile\nimport requests\nfrom collections import Counter\n\n\ndef download_file(url, target_path=None):\n    \"\"\"\n    Download the file from the given `url` and store it at `target_path`.\n    \"\"\"\n    if target_path is None:\n        target_path = url.split(\"\/\")[-1]\n\n    r = requests.get(url, stream=True)\n\n    with open(target_path, 'wb') as f:\n        for chunk in r.iter_content(chunk_size=1024):\n            if chunk:\n                f.write(chunk)\n    return target_path\n\ndef extract_zip(zip_path, target_folder):\n    \"\"\"\n    Extract the content of the zip-file at `zip_path` into `target_folder`.\n    \"\"\"\n    with zipfile.ZipFile(zip_path) as archive:\n        archive.extractall(target_folder)\n\n\n\ndef get_data(fname=None, all=1):\n    \n    #sstopwords=set(stopwords.words('german'))\n    if fname is None:        \n        fname = \".\/data\/train.txt\"\n    texts = []\n    labels = []\n\n    #split the lines into text and labels and save separately\n    with open(fname) as f1:\n        for l1 in f1:\n            ws,lab = l1.decode(\"utf-8\").split(\"\\t\")\n            texts.append(ws)\n            labels.append(lab.strip())\n\n    #character embeddings\n\n    #words = word_tokenize(words_raw)\n    tokensents = [tk.lower().split() for tk in texts]\n    words = set([word for tokens in tokensents for word in tokens])\n\n    chars = list(' '.join(words))\n\n    char_counts = Counter(chars)\n    labels_dict = {}\n    labels_nr = []\n    nums = set()\n    for i1,lab in enumerate(labels):\n        #this step will be explained later and means, that setences with maximum of only 2 tokens should\n        # be ignored\n        if all==0 and len(tokensents[i1])&lt;=2:\n            continue\n        nums.add(i1)\n        if lab in labels_dict:\n            labels_nr.append(labels_dict[lab])\n        else:\n            labels_dict[lab] = len(labels_dict)\n            labels_nr.append(labels_dict[lab])\n    tokensents = [tk for i1,tk in enumerate(tokensents) if i1 in nums]\n    return labels, labels_nr, labels_dict, tokensents, words, chars, char_counts\n\n#training data\nfname=\"https:\/\/scholar.harvard.edu\/files\/malmasi\/files\/vardial2018-gdi-training.zip\"\ntarget_path = download_file(fname)\nextract_zip( target_path, \".\/data\/\")\n\nlabels_train, labels_nr_train, labels_dict_train,sents_train_raw, words_train, chars_train, char_counts_train = get_data()\nlabels_dev_dev, labels_nr_dev, labels_dict_dev,sents_dev_raw, words_dev, chars_dev, char_counts_dev = get_data(\".\/data\/dev.txt\")\n\nsents_train= [\" \".join(tk).lower() for tk in sents_train_raw]\nsents_dev= [\" \".join(tk).lower() for tk in sents_dev_raw]<\/pre>\n\n\n\n<p>Let\u2019s try to classify it:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.metrics import f1_score\n\nfrom sklearn.svm import LinearSVC\n\nfrom sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer\n\nn_features=20000\n\nclf_svm = LinearSVC(random_state=0,C=1)\ntf_vectorizer = CountVectorizer( min_df=2,\n                                max_features=n_features)\n\n\ntf_train = tf_vectorizer.fit_transform(sents_train)\n\ntf_dev = tf_vectorizer.transform(sents_dev)\n\nclf_svm.fit(tf_train,labels_nr_train)\n\nprint(\"SVM TF weighted\",f1_score(clf_svm.predict(tf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF macro\",f1_score(clf_svm.predict(tf_dev),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF weighted 0.6100404138001064\nSVM TF macro 0.6036412574311785\n<\/pre>\n\n\n\n<p>This seems already pretty good, compared to last years results, where the best score was 0.66 (weighted F-1 score, <a href=\"https:\/\/www.aclweb.org\/anthology\/W\/W17\/W17-1201.pdf\">https:\/\/www.aclweb.org\/anthology\/W\/W17\/W17-1201.pdf<\/a> ). So far we are 9th place. There are still 8 other different  approaches left, making up for those 0.05 points of difference of our  score compared to the best score, which we intend to surpass. Let\u2019s see  how.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">It is all about preprocessing<\/h2>\n\n\n\n<p>But we can do much better using  just a simple trick: We skip the one and two words sentences, such as \u201cjo jo\u201d. This is because they are bad for training the classifier. Why are&nbsp;they bad, mostly for two reasons: a) too few features (only one and two words), therefore difficult to differentiate especially TF or TF-IDF get somwhat confused, and b) also very confusing labels. That means the same sentence being labelled with two different labels, e.g. the \u201cjo jo\u201d sentence (line 15 of training data was labelled first BS and later in line 2921 LU). Let\u2019s see how good this works.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">labels_train, labels_nr_train, labels_dict_train,sents_train_raw, words_train, chars_train, char_counts_train = get_data(all=0)\nlabels_dev_dev, labels_nr_dev, labels_dict_dev,sents_dev_raw, words_dev, chars_dev, char_counts_dev = get_data(\"dev.txt.gz\",all=0)\n\nsents_train= [\" \".join(tk).lower() for tk in sents_train_raw]\nsents_dev= [\" \".join(tk).lower() for tk in sents_dev_raw]\n\ntf_train = tf_vectorizer.fit_transform(sents_train)\n\ntf_dev = tf_vectorizer.transform(sents_dev)\n\nclf_svm.fit(tf_train,labels_nr_train)\n\nprint(\"SVM TF weighted\",f1_score(clf_svm.predict(tf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF macro\",f1_score(clf_svm.predict(tf_dev),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF weighted 0.6438060442167858\nSVM TF macro 0.6366767077208305\n<\/pre>\n\n\n\n<p>Wow, 3 additional points, which gives us already fourth place compared to last  year. Surprisingly, we are just counting the words appearing in the  sentences. Normally, SVMs work best with normalized data, but let\u2019s leave that for another tutorial.<br> Let\u2019s see if there is a weighting scheme better than term frequency.  Normally Term-Frequency Inverse-Document-Frequency (TF-IDF) works better  for document classification.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tfidf_vectorizer = TfidfVectorizer( min_df=2,\n                                 norm=\"l2\",\n                                   max_features=n_features)\n\ntfidf_train = tfidf_vectorizer.fit_transform(sents_train)\ntfidf_dev = tfidf_vectorizer.transform(sents_dev)\n\nclf_svm.fit(tfidf_train,labels_nr_train)\n\n\nprint(\"SVM TF-IDF weighted\",f1_score(clf_svm.predict(tfidf_dev),labels_nr_dev, average=\"weighted\"))\n\nprint(\"SVM TF-IDF macro\",f1_score(clf_svm.predict(tfidf_dev),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF weighted 0.659225631519502\nSVM TF-IDF macro 0.6514922794527852\n<\/pre>\n\n\n\n<p>This would give us third place, almost second. But this is with SVM \u2013 do others classifiers perform well?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.ensemble import RandomForestClassifier\nfrom sklearn.ensemble import BaggingClassifier\nimport numpy as np\nimport scipy.sparse\n\nclf = RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=0)\n\nclf.fit(tf_train,labels_nr_train)\n\n\nprint(\"RF TF weighted\",f1_score(clf.predict(tf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"RF TF macro\",f1_score(clf.predict(tf_dev),labels_nr_dev, average=\"macro\"))\n\n\nclf.fit(tfidf_train,labels_nr_train)\n\nprint(\"RF TF-IDF weighted\",f1_score(clf.predict(tfidf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"RF TF-IDF macro\",f1_score(clf.predict(tfidf_dev),labels_nr_dev, average=\"macro\"))\n\n\nbagging = BaggingClassifier( RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=0),\n                             max_samples=0.5, max_features=0.5)\nbagging.fit(tf_train,labels_nr_train)\n\n\nprint(\"Bagging TF weighted\",f1_score(bagging.predict(tf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"Bagging TF macro\",f1_score(bagging.predict(tf_dev),labels_nr_dev, average=\"macro\"))\nfrom sklearn.ensemble import BaggingClassifier\nbagging = BaggingClassifier( RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=0),\n                             max_samples=0.5, max_features=0.5)\nbagging.fit(tfidf_train,labels_nr_train)\n\n\nprint(\"Bagging TFIDF weighted\",f1_score(bagging.predict(tfidf_dev),labels_nr_dev, average=\"weighted\"))\nprint(\"Bagging TFIDF macro\",f1_score(bagging.predict(tfidf_dev),labels_nr_dev, average=\"macro\"))\n\nbagging = BaggingClassifier( RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=0),\n                             max_samples=0.5, max_features=0.5)\n\nbagging.fit(scipy.sparse.hstack([tf_train,tfidf_train]).tocsr(),labels_nr_train)\n\n\nprint(\"Bagging TF+TFIDF weighted\",f1_score(bagging.predict(scipy.sparse.hstack([tf_dev,tfidf_dev]).tocsr()),labels_nr_dev, average=\"weighted\"))\nprint(\"Bagging TF+TFIDF macro\",f1_score(bagging.predict(scipy.sparse.hstack([tf_dev,tfidf_dev]).tocsr()),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">RF TF weighted 0.4885197964113094\nRF TF macro 0.4221673549708851\nRF TF-IDF weighted 0.4893916483789577\nRF TF-IDF macro 0.4228223976071996\nBagging TF weighted 0.4789533772991431\nBagging TF macro 0.39480401932741443\nBagging TFIDF weighted 0.47137943835930063\nBagging TFIDF macro 0.3838371628473811\nBagging TF+TFIDF weighted 0.45362385515145237\nBagging TF+TFIDF macro 0.3778279814338985\n<\/pre>\n\n\n\n<p>Bagging and Random Forest (RF) sometimes are good, sometimes are bad; in this  case applying bagging decreases the SVMs\u2019 performance.<br> Let\u2019s stick with SVM for now. We will now try to use character-based  n-grams, since our task, differentiating between the dialects, is mostly about spelling, which can be indirectly perceived by the histogram of characters.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tf_vectorizer_char_ngram = CountVectorizer( min_df=2,\n                                               analyzer=\"char\",\n                                   max_features=n_features, ngram_range=(1, 7))\ntf_train_char_ngram = tf_vectorizer_char_ngram.fit_transform(sents_train)\ntf_dev_char_ngram = tf_vectorizer_char_ngram.transform(sents_dev)\n\nclf_svm.fit(tf_train_char_ngram,labels_nr_train)\npreds=clf_svm.predict(tf_dev_char_ngram)\nprint(\"SVM TF-IDF tf char ngram 7\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF tf char ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF tf char ngram weighted 0.6558271945540242\nSVM TF-IDF tf char ngram macro 0.6468589036370523\n<\/pre>\n\n\n\n<p>Not bad, counting characters is about as good as TF-IDF with words. What about TF-IDF of characters?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># additionally use normlized tfidf char bigrams \ntfidf_vectorizer_char_ngram = TfidfVectorizer( min_df=2,\n                                 norm=\"l2\",\n                                               analyzer=\"char\",\n                                   max_features=n_features, ngram_range=(1, 7))\ntfidf_train_char_ngram = tfidf_vectorizer_char_ngram.fit_transform(sents_train)\ntfidf_dev_char_ngram = tfidf_vectorizer_char_ngram.transform(sents_dev)\n\nclf_svm.fit(tfidf_train_char_ngram,labels_nr_train)\npreds=clf_svm.predict(tfidf_dev_char_ngram)\nprint(\"SVM TF-IDF tf char ngram 7\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF tf char ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF tf char ngram weighted 0.6774906696759666\nSVM TF-IDF tf char ngram macro 0.6658621351916394\n<\/pre>\n\n\n\n<p>Wow,  now we would win the competition from 2017! Here we also see that TF-IDF  works well for sentences and characters. But this was out of the box,  can we create better bi-grams? And why bi-grams? Because we are looking  for features which are phonetically similar and similarly structured,  allowing not quite perfect match. Also it gives us a kind of histogram  over the phonemes. Furthermore, the number of \u201caas\u201d and \u201cee\u201d might give  good hints. We\u2019ll show you why the tfidf_vectorizer with char n-gram could do even better:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tfidf_vectorizer_char_ngram.get_feature_names()[:20]\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">[' ',\n ' a',\n ' a ',\n ' a d',\n ' a d ',\n ' a de',\n ' a de ',\n ' aa',\n ' aa ',\n ' aab',\n ' aabe',\n ' aaf',\n ' aafa',\n ' aafan',\n ' aafang',\n ' aag',\n ' aagf',\n ' aagfa',\n ' aagfan',\n ' aagl']<\/pre>\n\n\n\n<p>We can see lots of white spaces, which do not provide a good insight on the  structure of the words. However, silence is an important part of the music, so we will try without it, that means removing the spaces. Also until now we examined with only one feature, the winner of last year\u2019s competition did win with multiple features. Can we mix them somehow and get better results?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">def gather_bigrams(data):\n    res = set()\n    for n1 in data:\n        res.update(bigrams(n1))\n    return list(res)\n\ndef bigrams(word):\n    chars = [c for c in word]\n    bigrams = [c1 + c2 for c1, c2 in zip(chars, chars[1:])]\n    features = chars + bigrams\n    return features\n\ndef transform_features(data_train, data_test, n_grams=1):\n\n    bigrams_list = gather_bigrams([tj for tk in data_train for tj in tk.split() if tj.find(\" \")==-1])\n    cv = CountVectorizer(\n            analyzer=bigrams,\n       # analyzer=\"char\",\n            preprocessor=lambda x : x,\n            vocabulary=bigrams_list,\n        ngram_range=(1, n_grams))\n\n    \n    X_train = cv.fit_transform(data_train)\n\n    X_test = cv.transform(data_test)\n    return X_train, X_test\n\nX_train, X_test = transform_features(sents_train,sents_dev)\n\nclf_svm.fit(X_train,labels_nr_train)\npreds=clf_svm.predict(X_test)\nprint(\"SVM CV + Bigrams weighted\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM CV + Bigrams macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\nclf_svm.fit(scipy.sparse.hstack([tfidf_train,X_train]),labels_nr_train)\npreds=clf_svm.predict(scipy.sparse.hstack([tfidf_dev.todense(),X_test]))\nprint(\"SVM CV + Bigrams weighted TF-IDF+Bigrams\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM CV + Bigrams macro TF-IDF+Bigrams\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM CV + Bigrams weighted 0.5910983725307404\nSVM CV + Bigrams macro 0.5789372643322429\nSVM CV + Bigrams weighted TF-IDF+Bigrams 0.6773374236772861\nSVM CV + Bigrams macro TF-IDF+Bigrams 0.6700263168459002\n<\/pre>\n\n\n\n<p>Although  by itself, this simple bi-gram analyzer performs poorly, in combination  with TF-IDF, it increased the macro F-1 score considerably. The 0.005 increase compared to TF-IDF alone is what makes the biggest difference between the final scores. So, combining single features might be a good approach.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Merging Features<\/h2>\n\n\n\n<p>Let\u2019s try to answer the following question: What is the impact of using n-grams instead of words for TF-IDF (while  keeping the character bi-grams)?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">tfidf_vectorizer_ngram = TfidfVectorizer( min_df=2,\n                                 norm=\"l2\",\n                                   max_features=n_features, ngram_range=(1, 7))\ntfidf_train_ngram = tfidf_vectorizer_ngram.fit_transform(sents_train)\ntfidf_dev_ngram = tfidf_vectorizer_ngram.transform(sents_dev)\n \nclf_svm.fit(scipy.sparse.hstack([tfidf_train_ngram,X_train]),labels_nr_train)\npreds=clf_svm.predict(scipy.sparse.hstack([tfidf_dev_ngram,X_test]))\nprint(\"SVM TF-IDF + Bigrams + word 7 ngram 5\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF + Bigrams + word 7 ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF + Bigrams + word 7 ngram weighted 0.678517820542825\nSVM TF-IDF + Bigrams + word 7 ngram macro 0.6686880145927554\n<\/pre>\n\n\n\n<p>Apparently lower macro and higher weighted F-1 score. What about using custom bi-grams?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">X_train_ngrams, X_test_ngrams = transform_features(sents_train,sents_dev, n_grams=7)\n\nclf_svm.fit(scipy.sparse.hstack([tfidf_train_ngram,X_train_ngrams]),labels_nr_train)\npreds=clf_svm.predict(scipy.sparse.hstack([tfidf_dev_ngram,X_test_ngrams]))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram 6\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF + Bigrams  + char 7 ngram weighted 0.678517820542825\nSVM TF-IDF + Bigrams  + char 7 ngram macro 0.6686880145927554\n<\/pre>\n\n\n\n<p>It \nlooks like no changes, likely because the n-grams are already covering \nwhitespace problems. Let\u2019s try to put even more features together.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">clf_svm.fit(scipy.sparse.hstack([tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram]),labels_nr_train)\npreds=clf_svm.predict(scipy.sparse.hstack([tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram]))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram 7\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram weighted 0.6807815919337346\nSVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram macro 0.6708978004436262\n<\/pre>\n\n\n\n<p>Nice, slightly more for both scores.<br>\nNow, lets try normalizing the tf and using character n-gram count.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.preprocessing import normalize\n\ntf_vectorizer_char_ngram = CountVectorizer( min_df=2,\n                                               analyzer=\"char\",\n                                   max_features=n_features, ngram_range=(1, 7))\ntf_train_char_ngram = tf_vectorizer_char_ngram.fit_transform(sents_train)\ntf_dev_char_ngram = tf_vectorizer_char_ngram.transform(sents_dev)\n\ntf_train_char_ngram = normalize(tf_train_char_ngram)\ntf_dev_char_ngram = normalize(tf_dev_char_ngram)\n\n\nclf_svm.fit(scipy.sparse.hstack([tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]),labels_nr_train)\npreds=clf_svm.predict(scipy.sparse.hstack([tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram,tf_dev_char_ngram]))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram 8\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram weighted 0.6817307419317699\nSVM TF-IDF + Bigrams  + char 7 ngram (bigrams) tfidf chars ngrams + tf char ngram macro 0.6725325917465754\n<\/pre>\n\n\n\n<p>Ok, not much difference, but still an increase.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Merging Predictions<\/h2>\n\n\n\n<p>Until now we just concatenated the features, but they are quite different, and can cause the SVMs to give  more importance to features with higher value. A possible counter action  would be to normalize it. Yet, since we have many different features  and many different possible ways to normalize (over train, over feature,  over sample, linear versus non-linear), we will use separate SVMs and use voting (ensemble voting).<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">svms=[]\nfor tk in [tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]:\n    svms.append(LinearSVC(random_state=0, C=1))\n    svms[-1].fit(tk,labels_nr_train)\n\nsvms_preds=[]\nfor i1,tk in enumerate([tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram,tf_dev_char_ngram]):\n    svms_preds.append(svms[i1].predict(tk))\n\n\n\nsumpres=[Counter(np.array(svms_preds)[:,tk]).most_common()[0][0] for tk in range(len(svms_preds[0]))]\nprint(\"SVM TF-IDF separate svm for features Bigrams   char 7 ngram 9\",f1_score(sumpres,labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF separate svm for features Bigrams   char 7 ngram macro\",f1_score(sumpres,labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF separate svm for features Bigrams   char 7 ngram weighted 0.6894441553644064\nSVM TF-IDF separate svm for features Bigrams   char 7 ngram macro 0.6758401343396775\n<\/pre>\n\n\n\n<p>Cool, not bad, almost one point in weighted F-1. But again, the vote is not  weighted, we are giving the same importance for each feature SVM,  handling low and high score predictions equally. What if we use the  scores of the SVM (decision_function) for the weighting?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">svms=[]\nfor tk in [tf_train, tfidf_train, tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]:\n    svms.append(LinearSVC(random_state=0, C=1))\n    svms[-1].fit(tk,labels_nr_train)\n\nsvms_preds=[]\nfor i1,tk in enumerate([tf_dev, tfidf_dev,tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram,tf_dev_char_ngram]):\n    svms_preds.append(svms[i1].decision_function(tk))\n    \nsumpres=sum([tk for tk in svms_preds])\nprint(\"SVM TF-IDF separate svm for features Bigrams   char 7 ngram 10\",f1_score(np.argmax(sumpres,1),labels_nr_dev, average=\"weighted\"))\nprint(\"SVM TF-IDF separate svm for features Bigrams   char 7 ngram macro\",f1_score(np.argmax(sumpres,1),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">SVM TF-IDF separate svm for features Bigrams   char 7 ngram weighted 0.6970817681702751\nSVM TF-IDF separate svm for features Bigrams   char 7 ngram macro 0.687829000954753\n<\/pre>\n\n\n\n<p>Much better, almost 0.70. We would have easily&nbsp;beaten the winner of last year\u2019s competition.<br>Still, the sum seems poor\u2026 can we build a meta classifier?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Meta-Classifier<\/h2>\n\n\n\n<p>A meta classifier could learn between the SVMs. Also most importantly, each SVM could specialize in the feature space, so any type of normalization could be avoided.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">svms=[]\nfor tk in [tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]:\n    svms.append(LinearSVC(random_state=0, C=1))\n    svms[-1].fit(tk,labels_nr_train)\n\nsvms_train_meta=[]\nfor i1,tk in enumerate([tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]):\n    svms_train_meta.append(svms[i1].decision_function(tk))\n\nsvm_meta=LinearSVC(random_state=0, C=0.75)\nsvm_meta.fit(scipy.sparse.hstack([np.concatenate(svms_train_meta,1),tfidf_train_ngram,X_train_ngrams,tfidf_train_char_ngram,tf_train_char_ngram]),labels_nr_train)\n\n\n\nsvms_preds=[]\nfor i1,tk in enumerate([tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram,tf_dev_char_ngram]):\n    svms_preds.append(svms[i1].decision_function(tk))\nsvm_meta_preds=svm_meta.decision_function(scipy.sparse.hstack([np.concatenate(svms_preds,1),tfidf_dev_ngram,X_test_ngrams,tfidf_dev_char_ngram,tf_dev_char_ngram]))\n\n\nprint(\"Meta SVM TF-IDF separate svm for features Bigrams   char 7 ngram\",f1_score(np.argmax(svm_meta_preds,1),labels_nr_dev, average=\"macro\"))\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">Meta SVM TF-IDF separate svm for features Bigrams   char 7 ngram macro 0.6678147793590107\n<\/pre>\n\n\n\n<p>That was not good, probably the classifiers did too well in the training set.  The solution is to do a cross validation, i.e. slice the training data  e.g. in 5 slices, use one slice for testing and the others for training, and iterate, switching the test slice every time (5 times in total). We make predictions on the test slice; gathering all predictions creates a prediction set with which we can train a meta classifier.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><a href=\"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/meta-cv.png\"><img decoding=\"async\" src=\"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/meta-cv-300x277.png\" alt=\"\" class=\"wp-image-3593\" \/><\/a><\/figure>\n\n\n\n<p>We  did not refactor the code, so it can be quite long to go through, but  we hope you see what the main idea is. We use the helper function  perform (described further below) from meta_cv (we will go into this  with more detail in another blog post). Let\u2019s see the results:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import meta_cv\nmeta_cv.perform(0)\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">uall 0 comp 0 tf 0\nMeta weighted 0.7010140759532392\nMeta macro 0.6906810078254101\n<\/pre>\n\n\n\n<p>Nice, now we are set to compete (apply to real test set).<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">meta_cv.perform(1)\n<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">uall 0 comp 1 tf 0\nMeta weighted 0.6541038424460858\nMeta macro 0.6463870277698942\n<\/pre>\n\n\n\n<p>Ok, with that result we got second place, without any parameter tuning. We  will discuss in the next blog post, how we can analyze our results and  improve our parameters.<\/p>\n\n\n\n<p>Below is the perform helper function:<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>from<\/strong> meta_cv <strong>import<\/strong> MetaCV\n<strong>\n<\/strong>#download gold labels\n gold_dir=\u201c.\/data\/gold\/\u201d\n <strong>if not<\/strong> os.path.exists(gold_dir):\n &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;os.makedirs(gold_dir)\n   target_path_gold=download_file(\u201chttps:\/\/drive.google.com\/uc?authuser=0&amp;id=0B8I6bgWbt_MfWHJrNDhpOFVVcUhabGFxWDJUWG9LR2hGODFJ&amp;export=download\u201d,  target_path=gold_dir+\u201ctestdata_fixed.zip\u201d)\n extract_zip( target_path_gold, gold_dir)\n\n<strong>def<\/strong> perform(comp):\n    meta=MetaCV(splits=5)\n    <strong>print<\/strong> (\"comp\",comp)\n    labels_train, labels_nr_train, labels_dict_train,sents_train_raw, words_train, chars_train, char_counts_train = get_data(all=0)\n    labels_dev_dev, labels_nr_dev, labels_dict_dev,sents_dev_raw, words_dev, chars_dev, char_counts_dev = get_data(fname=os.path.join(os.path.dirname(os.path.realpath(__file__)),\".\/data\/dev.txt\", all=0))    \n    sents_train= [\" \".join(tk).lower() <strong>for<\/strong> tk <strong>in<\/strong> sents_train_raw]\n    sents_dev= [\" \".join(tk).lower() <strong>for<\/strong> tk <strong>in<\/strong> sents_dev_raw]\n    labels_dev_dev, labels_nr_dev, labels_dict_dev,sents_dev_raw, words_dev, chars_dev, char_counts_dev = get_data(fname=os.path.join(os.path.dirname(os.path.realpath(__file__)),\".\/data\/dev.txt\",all=0))\n        \n    labels_test, labels_nr_test, labels_dict_test,sents_test_raw, words_test, chars_test, char_counts_test = preprocessing.get_data(fname=os.path.join(os.path.dirname(os.path.realpath(__file__)),\".\/data\/gold\/gold.txt\"))\n\n    <strong>if<\/strong> comp==0:\n        meta.fit(sents_train,labels_nr_train)\n        preds=meta.predict(sents_dev)\n        <strong>print<\/strong>(\"SVM TF\",f1_score(preds,labels_nr_dev, average=\"weighted\"))\n        <strong>print<\/strong>(\"SVM TF macro\",f1_score(preds,labels_nr_dev, average=\"macro\"))\n    <strong>else<\/strong>:\n        rev_labels_dict_train = dict([(tv,tk) <strong>for<\/strong> tk,tv <strong>in<\/strong> labels_dict_train.items()])\n        sents_train_comp= [\" \".join(tk).lower() <strong>for<\/strong> tj <strong>in<\/strong> [sents_train_raw,sents_dev_raw] <strong>for<\/strong> tk <strong>in<\/strong> tj]\n        labels_nr_train_comp = [ tk <strong>for<\/strong> tj <strong>in<\/strong> [labels_nr_train,labels_nr_dev] <strong>for<\/strong> tk <strong>in<\/strong> tj]\n        meta.fit(sents_train_comp,labels_nr_train_comp) \n        sents_test_comp= [\" \".join(tk).lower() <strong>for<\/strong> tk <strong>in<\/strong> sents_test_raw]\n        preds=meta.predict(sents_test_comp)\n        labels_preds=[rev_labels_dict_train[np.argmax(tk)] <strong>for<\/strong> tk <strong>in<\/strong> preds]\n        <strong>with<\/strong> open(\"prediction.labels\",\"w\") <strong>as<\/strong> f1:\n            <strong>for<\/strong> label <strong>in<\/strong> labels_preds:\n                f1.write(label+\"\\n\")\n\n        <strong>from<\/strong> sklearn.model_selection <strong>import<\/strong> cross_val_score\n\n        preds_score=meta.decision_function(sents_test_comp)\n\n        trf=np.argmax(preds_score,1)\n\n        rev_labels_dict_train=dict([(tv,tk) <strong>for<\/strong> tk,tv <strong>in<\/strong> labels_dict_train.items()])\n        rev_labels_dict_train[-1]=\"XY\"\n\n        tr_labs=[rev_labels_dict_train[trf[tk]] <strong>for<\/strong> tk <strong>in<\/strong> range(trf.shape[0])] \n\n        <strong>with<\/strong> open(\"predictions_c5_metacv_multiclass_threshold.labels\",\"w\") <strong>as<\/strong> f1:\n            <strong>for<\/strong> tr1 <strong>in<\/strong> tr_labs:\n                f1.write(tr1+\"\\n\")\n\n        gdi4=np.isin(labels_nr_test,[labels_dict_test[tk] <strong>for<\/strong> tk <strong>in<\/strong> [\"ZH\",\"LU\",\"BE\",\"BS\"]])\n        index=np.where(gdi4)[0]\n\n        <strong>print<\/strong>(\"SVM TF\",f1_score([labels_dict_test[rev_labels_dict_train[tk]] <strong>for<\/strong> tk <strong>in<\/strong> np.array(trf)[index]],np.array(labels_nr_test)[index], average=\"weighted\"))\n        <strong>print<\/strong>(\"SVM TF macro\",f1_score([labels_dict_test[rev_labels_dict_train[tk]] <strong>for<\/strong> tk <strong>in<\/strong> np.array(trf)[index]],np.array(labels_nr_test)[index], average=\"macro\"))\n\n        <strong>return<\/strong> index, gdi4,trf, labels_dict_test,rev_labels_dict_train,labels_nr_test, preds_score\n<\/pre>\n<div class=\"pt-sm\">Schlagw\u00f6rter: <a href=\"https:\/\/blog.zhaw.ch\/datascience\/tag\/nlp\/\">NLP<\/a>, <a href=\"https:\/\/blog.zhaw.ch\/datascience\/tag\/programming\/\">Programming<\/a>, <a href=\"https:\/\/blog.zhaw.ch\/datascience\/tag\/python\/\">Python<\/a>, <a href=\"https:\/\/blog.zhaw.ch\/datascience\/tag\/research\/\">Research<\/a>, <a href=\"https:\/\/blog.zhaw.ch\/datascience\/tag\/tutorial\/\">Tutorial<\/a><br><\/div>","protected":false},"excerpt":{"rendered":"<p>by Fernando Benites (ZHAW and SpinningBytes) cross-posted from the SpinningBytes blog schwiiz ja* This year, the SpinningBytes team participated in the VarDial competition, where we achieved second place in the German Dialect Identification shared task. The task\u2019s goal was to identify, which region the speaker of a given sentence is from, based on the dialect [&hellip;]<\/p>\n","protected":false},"author":265,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1,7,71],"tags":[67,44,38,70,72],"features":[],"class_list":["post-985","post","type-post","status-publish","format-standard","hentry","category-allgemein","category-blog","category-technical","tag-nlp","tag-programming","tag-python","tag-research","tag-tutorial"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.7 (Yoast SEO v27.7) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>Twist Bytes @Vardial 2018 - Data Science made in Switzerland<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Twist Bytes @Vardial 2018\" \/>\n<meta property=\"og:description\" content=\"by Fernando Benites (ZHAW and SpinningBytes) cross-posted from the SpinningBytes blog schwiiz ja* This year, the SpinningBytes team participated in the VarDial competition, where we achieved second place in the German Dialect Identification shared task. The task\u2019s goal was to identify, which region the speaker of a given sentence is from, based on the dialect [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/\" \/>\n<meta property=\"og:site_name\" content=\"Data Science made in Switzerland\" \/>\n<meta property=\"article:published_time\" content=\"2019-07-21T12:37:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-07-21T12:38:51+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png\" \/>\n<meta name=\"author\" content=\"mild\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"mild\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"22 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/\"},\"author\":{\"name\":\"mild\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/#\\\/schema\\\/person\\\/64f2a57e0efd0aa4c73f45df76618116\"},\"headline\":\"Twist Bytes @Vardial 2018\",\"datePublished\":\"2019-07-21T12:37:13+00:00\",\"dateModified\":\"2019-07-21T12:38:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/\"},\"wordCount\":1478,\"commentCount\":0,\"image\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/www.spinningbytes.com\\\/wp-content\\\/uploads\\\/2018\\\/06\\\/table2017-300x253.png\",\"keywords\":[\"NLP\",\"Programming\",\"Python\",\"Research\",\"Tutorial\"],\"articleSection\":[\"Allgemein\",\"Blog\",\"Technical\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/\",\"url\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/\",\"name\":\"Twist Bytes @Vardial 2018 - Data Science made in Switzerland\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#primaryimage\"},\"thumbnailUrl\":\"http:\\\/\\\/www.spinningbytes.com\\\/wp-content\\\/uploads\\\/2018\\\/06\\\/table2017-300x253.png\",\"datePublished\":\"2019-07-21T12:37:13+00:00\",\"dateModified\":\"2019-07-21T12:38:51+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/#\\\/schema\\\/person\\\/64f2a57e0efd0aa4c73f45df76618116\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#primaryimage\",\"url\":\"http:\\\/\\\/www.spinningbytes.com\\\/wp-content\\\/uploads\\\/2018\\\/06\\\/table2017-300x253.png\",\"contentUrl\":\"http:\\\/\\\/www.spinningbytes.com\\\/wp-content\\\/uploads\\\/2018\\\/06\\\/table2017-300x253.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/twist-bytes-vardial-2018\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Startseite\",\"item\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Twist Bytes @Vardial 2018\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/#website\",\"url\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/\",\"name\":\"Data Science made in Switzerland\",\"description\":\"Ein Blog der ZHAW Z\u00fcrcher Hochschule f\u00fcr Angewandte Wissenschaften\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/#\\\/schema\\\/person\\\/64f2a57e0efd0aa4c73f45df76618116\",\"name\":\"mild\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g\",\"caption\":\"mild\"},\"url\":\"https:\\\/\\\/blog.zhaw.ch\\\/datascience\\\/author\\\/mild\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Twist Bytes @Vardial 2018 - Data Science made in Switzerland","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/","og_locale":"en_US","og_type":"article","og_title":"Twist Bytes @Vardial 2018","og_description":"by Fernando Benites (ZHAW and SpinningBytes) cross-posted from the SpinningBytes blog schwiiz ja* This year, the SpinningBytes team participated in the VarDial competition, where we achieved second place in the German Dialect Identification shared task. The task\u2019s goal was to identify, which region the speaker of a given sentence is from, based on the dialect [&hellip;]","og_url":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/","og_site_name":"Data Science made in Switzerland","article_published_time":"2019-07-21T12:37:13+00:00","article_modified_time":"2019-07-21T12:38:51+00:00","og_image":[{"url":"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png","type":"","width":"","height":""}],"author":"mild","twitter_card":"summary_large_image","twitter_misc":{"Written by":"mild","Est. reading time":"22 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#article","isPartOf":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/"},"author":{"name":"mild","@id":"https:\/\/blog.zhaw.ch\/datascience\/#\/schema\/person\/64f2a57e0efd0aa4c73f45df76618116"},"headline":"Twist Bytes @Vardial 2018","datePublished":"2019-07-21T12:37:13+00:00","dateModified":"2019-07-21T12:38:51+00:00","mainEntityOfPage":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/"},"wordCount":1478,"commentCount":0,"image":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#primaryimage"},"thumbnailUrl":"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png","keywords":["NLP","Programming","Python","Research","Tutorial"],"articleSection":["Allgemein","Blog","Technical"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/","url":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/","name":"Twist Bytes @Vardial 2018 - Data Science made in Switzerland","isPartOf":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#primaryimage"},"image":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#primaryimage"},"thumbnailUrl":"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png","datePublished":"2019-07-21T12:37:13+00:00","dateModified":"2019-07-21T12:38:51+00:00","author":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/#\/schema\/person\/64f2a57e0efd0aa4c73f45df76618116"},"breadcrumb":{"@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#primaryimage","url":"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png","contentUrl":"http:\/\/www.spinningbytes.com\/wp-content\/uploads\/2018\/06\/table2017-300x253.png"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.zhaw.ch\/datascience\/twist-bytes-vardial-2018\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Startseite","item":"https:\/\/blog.zhaw.ch\/datascience\/"},{"@type":"ListItem","position":2,"name":"Twist Bytes @Vardial 2018"}]},{"@type":"WebSite","@id":"https:\/\/blog.zhaw.ch\/datascience\/#website","url":"https:\/\/blog.zhaw.ch\/datascience\/","name":"Data Science made in Switzerland","description":"Ein Blog der ZHAW Z\u00fcrcher Hochschule f\u00fcr Angewandte Wissenschaften","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.zhaw.ch\/datascience\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.zhaw.ch\/datascience\/#\/schema\/person\/64f2a57e0efd0aa4c73f45df76618116","name":"mild","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/3c38b532abe81ed471e1e6559571ef62f075b055ca6520f8c29ee603a233e272?s=96&d=mm&r=g","caption":"mild"},"url":"https:\/\/blog.zhaw.ch\/datascience\/author\/mild\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/posts\/985","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/users\/265"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/comments?post=985"}],"version-history":[{"count":10,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/posts\/985\/revisions"}],"predecessor-version":[{"id":996,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/posts\/985\/revisions\/996"}],"wp:attachment":[{"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/media?parent=985"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/categories?post=985"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/tags?post=985"},{"taxonomy":"features","embeddable":true,"href":"https:\/\/blog.zhaw.ch\/datascience\/wp-json\/wp\/v2\/features?post=985"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}