Avito Demand Prediction Kaggle competition
Competitions, Projects ·Avito launched a competition on Kaggle challenging users to predict Avito to predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.
Introduction
With ad demand being a promising topic, I was attracted to try to work on this competition having the opportunity to combine several feature types.
As a starting point, I started reading the published kernels and some papers, Dimitri Ad Clicking prediction paper was a detailed attractive paper to predict the probability that a user will click on an ad or be attracted to it based on the thumbnail of the ad.
In this paper some image features were introduced, these features was then implemented using OpenCV and added to a baseline LightGBM Model.
Image Features
-
calculate image simplicity => Used to calculate simplicity of input image.
-
image basic segment stats => Used to extract basic image segmentation statistics (tuple of 10 features).
-
image face feats => Used to extract number of faces from input image using pretrained HaarCascade from opencv.
-
image sift feats => number of sift keypoints extracted from input image
-
image rgb simplicity => get image simplicity feature from RGB image
-
image hsv simplicity => get image simplicity features from hsv image
-
image hue histogram => image features from histogram of HSV images
-
image grayscale simplicity => used for simplicity features on grayscale images
-
image sharpness => used to calculate image sharpness score
-
image contrast => used to calculate image contrast score
-
image saturation => used to calculate image saturation
-
image brightness => used to calculate image brightness score
-
image colorfulness => used to calculate colorfulness score based on the paper
Text Features
Count vectorizer for the title and for the description along with word counts in both of them (user is attracted to short and straight to the point description).
A feature called readability index Flesch English reading ease was extracted. This readability index based on Pyphen dictionary package was calculated by counting the average sentence length which is the lexicons count over the number of sentences found in the input description based on the punctuation.
def flesch_reading_ease( text):
ASL = avg_sentence_length(text) # lexicon_count/nsentences
ASW = avg_syllables_per_word(text) # number of syllables verified by Pyphen
FRE = 206.835 - float(1.015 * ASL) - float(84.6 * ASW)
return legacy_round(FRE, 2)
This feature along with Count vectorizer features for input title and description were improving the model, the difficult part in the input text is the language barrier (Russian Language). I was unable to verify the correctness of the readability index or to check the text patterns to get more features.
Feature Selection
Used BorutaPy package to select from all the previously extracted features (image and text features).
References
-
Check the code for the features extracted from input images Image Extraction for ad clicking