Рекомендательные системы машинное обучение python

Содержание

Recommendation System in Python
Рекомендательная система через поиск схожих изображений с помощью Resnet50
Код и датасет
Выводы

Recommendation System in Python

There are a lot of applications where websites collect data from their users and use that data to predict the likes and dislikes of their users. This allows them to recommend the content that they like. Recommender systems are a way of suggesting or similar items and ideas to a user’s specific way of thinking.

Recommender System is different types:

Collaborative Filtering: Collaborative Filtering recommends items based on similarity measures between users and/or items. The basic assumption behind the algorithm is that users with similar interests have common preferences.
Content-Based Recommendation: It is supervised machine learning used to induce a classifier to discriminate between interesting and uninteresting items for the user.

Content-Based Recommendation System: Content-Based systems recommends items to the customer similar to previously high-rated items by the customer. It uses the features and properties of the item. From these properties, it can calculate the similarity between the items.

In a content-based recommendation system, first , we need to create a profile for each item, which represents the properties of those items. From the user profiles are inferred for a particular user. We use these user profiles to recommend the items to the users from the catalog.

Content-Based Recommendation System

Item profile:

In a content-based recommendation system, we need to build a profile for each item, which contains the important properties of each item. For Example, If the movie is an item, then its actors, director, release year , and genre are its important properties , and for the document , the important property is the type of content and set of important words in it.

Let’s have a look at how to create an item profile. First, we need to perform the TF-IDF vectorizer, here TF (term frequency) of a word is the number of times it appears in a document and The IDF (inverse document frequency) of a word is the measure of how significant that term is in the whole corpus. These can be calculated by the following formula:

TF_<ij data-lazy-src=

= \frac>>» width=»» height=»»/>

where f_ij is the frequency of term(feature) i in document(item) j.

IDF_<i data-lazy-src=

= log_e \frac» width=»» height=»»/>

where, n_i number of documents that mention term i. N is the total number of docs.

= \fracS_r_>S_> \, \, S_ = sim(x,y) » width=»» height=»»/>

Advantages and Disadvantages:

Advantages:
- No need for the domain knowledge because embedding are learned automatically.
- Capture inherent subtle characteristics.
- Cannot handle fresh items due to cold start problem.
- Hard to add any new features that may improve quality of model
Рекомендательная система через поиск схожих изображений с помощью Resnet50
Глобально существует два подхода в создании рекомендательных систем. Контентно-ориентированная и коллаборативная фильтрация. Основополагающее предположение подхода коллаборативной фильтрации заключается в том, что если А и В покупают аналогичные продукты, А, скорее всего, купит продукт, который купил В, чем продукт, который купил случайный человек. В отличие от контентно-ориентированного подхода, здесь нет признаков, соответствующих пользователям или предметам. Рекомендательная система базируется на матрице взаимодействий пользователей. Контентно-ориентированная система базируется на знаниях о предметах. Например если пользователь смотрит шелковые футболки возможно ему будет интересно посмотреть на другие шелковые футболки.
В этой статье я хочу рассказать о подходе который основан на поиске схожих изображений. Зачем подготавливать дополнительнительные данные если почти все основные характеристики некоторых товаров, например одежда, можно отобразить на изображении.

Суть подхода заключается в извлчении признаков из изображений товаров. С помощью сверточной сети, в своем примере я использовал Resnet50, так как вектор признаков resnet имеет относительно небольшую размерность. Извлечь вектор признаков с помощью обученой сети очень просто. Нужно просто исключить softmax классификатор именно он определяет к какому классу относится изображение и мы получим на выходе вектор признаков. Далее необходимо сравнивать векторы и искать похожие. Чем более схожи изображения тем меньше евклидово расстояние между векторами.

Код и датасет

Датасет можно скачать отсюда: ссылка на датасет.

Инициализации обученой restnet50 из библиотеки pytorch и извлечении признаков из датасета:
```
from torchvision.io import read_image from torchvision.models import resnet50, ResNet50_Weights import torch import glob import pickle from tqdm import tqdm from PIL import Image def pil_loader(path): # Некоторые изображения из датасета представленны не в RGB формате, необходимо их конверитровать в RGB with open(path, 'rb') as f: img = Image.open(f) return img.convert('RGB') # Инициализация модели обученой на датасете imagenet weights = ResNet50_Weights.DEFAULT model = resnet50(weights=weights) model.eval() preprocess = weights.transforms() use_precomputed_embeddings = True emb_filename = 'fashion_images_embs.pickle' if use_precomputed_embeddings: with open(emb_filename, 'rb') as fIn: img_names, img_emb_tensors = pickle.load(fIn) print("Images:", len(img_names)) else: img_names = list(glob.glob('images/*.jpg')) img_emb = [] # извлечение признаков из изображений в датасете. У меня на CPU заняло около часа for image in tqdm(img_names): img_emb.append( model(preprocess(pil_loader(image)).unsqueeze(0)).squeeze(0).detach().numpy() ) img_emb_tensors = torch.tensor(img_emb) with open(emb_filename, 'wb') as handle: pickle.dump([img_names, img_emb_tensors], handle, protocol=pickle.HIGHEST_PROTOCOL)
```
Функция которая создает поисковый индекс с помощью faiss и уменьшает размерность векторов признаков:
```
# Для сравнения векторов используется faiss import faiss from sklearn.decomposition import PCA def build_compressed_index(n_features): pca = PCA(n_components=n_features) pca.fit(img_emb_tensors) compressed_features = pca.transform(img_emb_tensors) dataset = np.float32(compressed_features) d = dataset.shape[1] nb = dataset.shape[0] xb = dataset index_compressed = faiss.IndexFlatL2(d) index_compressed.add(xb) return [pca, index_compressed] 
```
Хэлперы для отображения результатов:
```
import matplotlib.pyplot as plt import matplotlib.image as mpimg def main_image(img_path, desc): plt.imshow(mpimg.imread(img_path)) plt.xlabel(img_path.split('.')[0] + '_Original Image',fontsize=12) plt.title(desc,fontsize=20) plt.show() def similar_images(indices, suptitle): plt.figure(figsize=(15,10), facecolor='white') plotnumber = 1 for index in indices[0:4]: if plotnumber
```
Сама функция поиска. Принимает на вход количество признаков, что бы можно было поэксперементировать с достаточным количеством признаков: import numpy as np # поиск, можно искать по индексу из предварительно извлеченных изображений или передать новое изображение def search(query, factors): if(type(query) == str): img_path = query else: img_path = img_names[query] one_img_emb = torch.tensor(model(preprocess(read_image(img_path)).unsqueeze(0)).squeeze(0).detach().numpy()) main_image(img_path, 'Query') compressor, index_compressed = build_compressed_index(factors) D, I = index_compressed.search(np.float32(compressor.transform([one_img_emb.detach().numpy()])),5) similar_images(I[0][1:], "faiss compressed " + str(factors)) Виновник торжества. Вызов поиска: search(100,300) search("t-shirt.jpg", 500) Выводы В итоге за пару часов можно собрать довольно качественную рекомендательную систему основаную на схожести изображений, чего достаточно для некоторых случаев. Изображения не требуют предварительной подготовки, разметки и какой то метаинформации что значительно упрощает процесс. Для повышения качества рекомендаций можно дообучить некторые слои сети на используемом датасете.
Источник

Рекомендательные системы машинное обучение python

Recommendation System in Python

Код и датасет

Выводы