Avito Demand Prediction Kaggle competition

Check the code  github

Avito launched a competition on Kaggle challenging users to predict Avito to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.

Introduction

With ad demand being a promising topic, I was attracted to try to work on this competition having the opportunity to combine several feature types.

As a starting point, I started reading the published kernels and some papers, Dimitri Ad Clicking prediction paper was a detailed attractive paper to predict the probability that a user will click on an ad or be attracted to it based on the thumbnail of the ad.

In this paper some image features were introduced, these features was then implemented using OpenCV and added to a baseline LightGBM Model.

Image Features

Text Features

Count vectorizer for the title and for the description along with word counts in both of them (user is attracted to short and straight to the point description).

A feature called readability index Flesch English reading ease was extracted. This readability index based on Pyphen dictionary package was calculated by counting the average sentence length which is the lexicons count over the number of sentences found in the input description based on the punctuation.


def flesch_reading_ease( text): 

    ASL = avg_sentence_length(text) # lexicon_count/nsentences 

    ASW = avg_syllables_per_word(text) # number of syllables verified by Pyphen 

    FRE = 206.835 - float(1.015 * ASL) - float(84.6 * ASW) 

    return legacy_round(FRE, 2) 

This feature along with Count vectorizer features for input title and description were improving the model, the difficult part in the input text is the language barrier (Russian Language). I was unable to verify the correctness of the readability index or to check the text patterns to get more features.

Feature Selection

Used BorutaPy package to select from all the previously extracted features (image and text features).

References