Avito Demand Prediction Kaggle competition

Competitions, Projects · 16 Aug 2018 - 3 minutes to read.

Avito launched a competition on Kaggle challenging users to predict Avito to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.

Introduction

With ad demand being a promising topic, I was attracted to try to work on this competition having the opportunity to combine several feature types.

As a starting point, I started reading the published kernels and some papers, Dimitri Ad Clicking prediction paper was a detailed attractive paper to predict the probability that a user will click on an ad or be attracted to it based on the thumbnail of the ad.

In this paper some image features were introduced, these features was then implemented using OpenCV and added to a baseline LightGBM Model.

Image Features

calculate image simplicity => Used to calculate simplicity of input image.
image basic segment stats => Used to extract basic image segmentation statistics (tuple of 10 features).
image face feats => Used to extract number of faces from input image using pretrained HaarCascade from opencv.
image sift feats => number of sift keypoints extracted from input image
image rgb simplicity => get image simplicity feature from RGB image
image hsv simplicity => get image simplicity features from hsv image
image hue histogram => image features from histogram of HSV images
image grayscale simplicity => used for simplicity features on grayscale images
image sharpness => used to calculate image sharpness score
image contrast => used to calculate image contrast score
image saturation => used to calculate image saturation
image brightness => used to calculate image brightness score
image colorfulness => used to calculate colorfulness score based on the paper

Text Features

Count vectorizer for the title and for the description along with word counts in both of them (user is attracted to short and straight to the point description).

A feature called readability index Flesch English reading ease was extracted. This readability index based on Pyphen dictionary package was calculated by counting the average sentence length which is the lexicons count over the number of sentences found in the input description based on the punctuation.

def flesch_reading_ease( text): 

    ASL = avg_sentence_length(text) # lexicon_count/nsentences 

    ASW = avg_syllables_per_word(text) # number of syllables verified by Pyphen 

    FRE = 206.835 - float(1.015 * ASL) - float(84.6 * ASW) 

    return legacy_round(FRE, 2)

This feature along with Count vectorizer features for input title and description were improving the model, the difficult part in the input text is the language barrier (Russian Language). I was unable to verify the correctness of the readability index or to check the text patterns to get more features.

Feature Selection

Used BorutaPy package to select from all the previously extracted features (image and text features).

References

Check the code for the features extracted from input images Image Extraction for ad clicking
https://simple.wikipedia.org/wiki/Flesch_Reading_Ease