Machine Learning with Laravel and AWS Personalize Part 1

AWS Personalize enables developers to build applications with the same Machine Learning technology used by Amazon for real time personalized recommendations, and no machine learning experience is required.

How does AWS Personalize Work?

AWS Personalize Use Cases

Some of the use cases that AWS will help with are:

User Impressions

You can model items that are seen, but not clicked can still drive relevant recommendations.
Item Exploration

You can balance between exploring for new content and content a user is likely to find relevant

That is where items with less interaction data are recommended more frequently, against how much to exploit where recommendations are based on what we know about each user's interest.

Future recommendations are then adjusted based on implicit user feedback

This allows you to balance business priorities against user preferences depending on if your audience likes to to search and explore or if they like content served to them.
Filtering for events

Exclude or include items to recommend based on event criteria
Filtering based on Metadata

This is critical so you can remove items that have already been purchased from being recommended or limit items based on specific fields such as category or genre or setting parental controls.
Cold Start

This allows you to make recommendations for new users or items without any interaction history.

Understanding your Data with AWS Personalize

In order to get the best performance from AWS Personalize you'll need a few things in your data.

A large collection of KNOWN users/customers
Lengthy user histories with items (10+ interaction points per user)
Items that are immutable (the same items over time)
A large collection of items

In order to achieve this we'll use some machine learning data provided by the MovieLens Dataset.

Once you've downloaded the MovieLens 25M Dataset, extract the zip and place it inside your storage/samples directory.

When extracted the ratings.csv is around ~650 MB, this file contains around 1 Billion user & event interaction points.

Remember to add the line to your config/filesystems.php

// config/filesystems.php

'disks' => [
    // ...

    'samples' => [
        'driver' => 'local',
        'root' => storage_path('samples'),
        'throw' => false,
    ],

    // ...
],

I've also added some custom files to help make seeding the data faster, which you can find here.

Your final directory structure will look like this.

ls -lh storage/samples

total 3634080
-rw-rw-r--@ 1 rob  staff   415M 22 Nov  2019 genome-scores.csv
-rw-rw-r--@ 1 rob  staff    18K 22 Nov  2019 genome-tags.csv
-rw-r--r--  1 rob  staff   1.0K 13 Mar 21:47 genres.csv
-rw-r--r--  1 rob  staff   5.9M 13 Mar 21:49 movie_genres.csv
-rw-rw-r--@ 1 rob  staff   2.8M 14 Mar 10:30 movies.csv
-rw-rw-r--@ 1 rob  staff   647M 22 Nov  2019 ratings.csv
-rw-r--r--  1 rob  staff   647M 14 Mar 10:11 ratings_without_heading.csv
-rw-rw-r--@ 1 rob  staff    37M 22 Nov  2019 tags.csv
-rw-r--r--  1 rob  staff   718K 14 Mar 10:41 users.csv

Source Code

If you want to follow along, you can find the source code here. Laravel with AWS Personalize

Install Dependencies

Make sure you install the composer require league/csv:^9.0, as we will need to process CSVs later, and this makes it a much nicer experience.

Models and Database Schema

Our Models and Schemas will look like this.

// database/migrations/2022_03_13_092433_create_movies_table.php

Schema::create('movies', function (Blueprint $table) {
    $table->id();

    $table->string('name')->index();

    $table->timestamps();
});

// database/migrations/2022_03_13_092512_create_genres_table.php

Schema::create('genres', function (Blueprint $table) {
    $table->id();

    $table->string('name')->unique();

    $table->timestamps();
});

Schema::create('movie_genres', function (Blueprint $table) {
    $table->id();

    $table->foreignId('movie_id')
        ->constrained('movies')
        ->onDelete('cascade');

    $table->foreignId('genre_id')
        ->constrained('genres')
        ->onDelete('cascade');

    $table->timestamps();
});

// database/migrations/2022_03_14_080111_create_ratings_table.php

Schema::create('ratings', function (Blueprint $table) {
    $table->id();

    $table->foreignId('user_id')
        ->constrained('users')
        ->onDelete('cascade');

    $table->foreignId('movie_id')
        ->constrained('movies')
        ->onDelete('cascade');

    $table->float('rating');

    $table->timestamps();
});

Database Seeders

We'll import the movies & genres from the MovieLens dataset.

It's much easier to work with datasets that have pretty names compared to the typical Lorem ipsum generators.

Lets create a trait so we can quickly read and insert the data.

<?php

namespace App\Traits;

use Illuminate\Support\Facades\Storage;
use Illuminate\Support\LazyCollection;

trait ReadsSamples
{
    protected function readSampleAsLazyCollection(string $filename)
    {
        return LazyCollection::make(function () use ($filename) {
            $handle = fopen(Storage::disk('samples')->path("{$filename}"), 'r');

            while (($line = fgets($handle)) !== false) {
                yield $line;
            }
        })->map(fn ($line) => str_getcsv($line));
    }
}

Then we'll create the following Seeders.

<?php

namespace Database\Seeders;

use App\Models\Genre;
use App\Models\Movie;
use App\Traits\ReadsSamples;
use Illuminate\Database\Seeder;
use Illuminate\Support\Facades\DB;

class MovieSeeder extends Seeder
{
    use ReadsSamples;

    public function run()
    {
        $this->createMovies();

        $this->createGenres();

        $this->createMovieGenres();
    }

    protected function createMovies()
    {
        $this->getMovieSamples()
            ->chunk(1000)
            ->each(function ($movies) {
                Movie::upsert([
                    ...$movies->map(function ($item) {
                        return collect($item)->except(['genres'])->toArray();
                    })
                ], ["id"], ["name"]);
            });
    }

    protected function createGenres()
    {
        $this->getGenreSamples()
            ->chunk(100)
            ->each(function ($genres) {
                Genre::upsert([
                    ...$genres->map(function ($item) {
                        return collect($item)->toArray();
                    })
                ], ["id"], ["name"]);
            });
    }

    protected function createMovieGenres()
    {
        $this->getMovieGenreSamples()
            ->chunk(1000)
            ->each(function ($genres) {
                DB::table('movie_genres')->upsert([
                    ...$genres->map(function ($item) {
                        return collect($item)->toArray();
                    })
                ], ["id"], ["movie_id", "genre_id", "created_at", "updated_at"]);
            });
    }

    protected function getMovieSamples()
    {
        return $this->readSampleAsLazyCollection('movies.csv')
            ->map(function ($movie) {
                    return [
                        'id' => $movie[0],
                        'name' => $movie[1],
                        'genres' => str($movie[2])->explode('|')
                    ];
                });
    }

    protected function getGenreSamples()
    {
        return $this->readSampleAsLazyCollection('genres.csv')
            ->map(function ($genre) {
                return [
                    'id' => $genre[0],
                    'name' => $genre[1],
                ];
            });
    }

    protected function getMovieGenreSamples()
    {
        return $this->readSampleAsLazyCollection('movie_genres.csv')
            ->map(function ($row) {
                return [
                    'id' => $row[0],
                    'movie_id' => $row[1],
                    'genre_id' => $row[2],
                    'created_at' => $row[3],
                    'updated_at' => $row[4],
                ];
            });
    }
}

<?php

namespace Database\Seeders;

use App\Models\User;
use App\Traits\ReadsSamples;
use Illuminate\Database\Seeder;

class UserSeeder extends Seeder
{
    use ReadsSamples;

    public function run()
    {
        $this->createUsers();
    }

    protected function createUsers()
    {
        $this->getUserSamples()
            ->chunk(1000)
            ->each(function ($users) {
                User::upsert([
                    ...$users->toArray()
                ], ["id"], ["name", "email", "email_verified_at", "password", "remember_token"]);
            });
    }

    protected function getUserSamples()
    {
        return $this->readSampleAsLazyCollection('users.csv')
            ->map(function ($row) {
                return array_merge(
                    User::factory(["id" => $row[0], "email" => "{$row[0]}@test.com"])->make()->toArray(),
                    [
                        'password' => '$2y$10$92IXUNpkjO0rOQ5byMi.Ye4oKoEa3Ro9llC/.og/at2.uheWG/igi',
                    ]
                );
            });
    }
}

<?php

namespace Database\Seeders;

use App\Models\Rating;
use App\Traits\ReadsSamples;
use Illuminate\Database\Seeder;

class RatingSeeder extends Seeder
{
    use ReadsSamples;

    public function run()
    {
        $this->createRatings();
    }

    protected function createRatings()
    {
        $this->getRatingSamples()
            ->chunk(1000)
            ->each(function ($ratings) {
                try {
                    Rating::upsert([
                        ...$ratings->toArray()
                    ], ["id"], ["user_id", "movie_id", "rating"]);
                } catch (\Throwable $t) {
                    // Do nothing
                }
            });
    }

    protected function getRatingSamples()
    {
        return $this->readSampleAsLazyCollection('ratings_without_heading.csv')
            ->skip(1)
            ->map(function ($row) {
                return [
                    'user_id' => $row[0],
                    'movie_id' => $row[1],
                    'rating' => $row[2]
                ];
            });
    }
}

And finally we have our DatabaseSeeder class.

public function run()
{
    $this->call([
        MovieSeeder::class,
        UserSeeder::class,
        RatingSeeder::class
    ]);
}

In your terminal run the following command:

$ sail artisan migrate:fresh --seed

You will see the following output.

Dropped all tables successfully.

Migration table created successfully.
Migrating: 2014_10_12_000000_create_users_table
Migrated:  2014_10_12_000000_create_users_table (30.89ms)
Migrating: 2014_10_12_100000_create_password_resets_table
Migrated:  2014_10_12_100000_create_password_resets_table (4.17ms)
Migrating: 2019_08_19_000000_create_failed_jobs_table
Migrated:  2019_08_19_000000_create_failed_jobs_table (8.01ms)
Migrating: 2019_12_14_000001_create_personal_access_tokens_table
Migrated:  2019_12_14_000001_create_personal_access_tokens_table (8.95ms)
Migrating: 2022_03_13_092433_create_movies_table
Migrated:  2022_03_13_092433_create_movies_table (6.93ms)
Migrating: 2022_03_13_092512_create_genres_table
Migrated:  2022_03_13_092512_create_genres_table (15.81ms)
Migrating: 2022_03_14_080111_create_ratings_table
Migrated:  2022_03_14_080111_create_ratings_table (3.91ms)
Seeding: Database\Seeders\MovieSeeder
Seeded:  Database\Seeders\MovieSeeder (3,924.79ms)
Seeding: Database\Seeders\UserSeeder
Seeded:  Database\Seeders\UserSeeder (22,753.31ms)
Seeding: Database\Seeders\RatingSeeder
Seeded:  Database\Seeders\RatingSeeder (571,578.33ms)

Database seeding completed successfully.

The final stats look like this.

| Model        | Count    |
| ------------ | -------- |
| Movies       | 62422    |
| Genres       | 20       |
| Movie Genres | 112458   |
| Users        | 120934   |
| Ratings      | 18663000 |

Exporting Interactions and Items

Now we'll create two artisan commands to assist with the exporting of the data.

// Console/Commands/ExportInteractionsCommand.php

<?php

namespace App\Console\Commands;

use App\Strategies\ExportInteractionsJob;
use Illuminate\Console\Command;

class ExportInteractionsCommand extends Command
{
    protected $signature = 'export:interactions';

    protected $description = 'Export all the positive interactions';

    public function handle()
    {
        ExportInteractionsJob::dispatchSync();
    }
}

// app/Console/Commands/ExportItemsCommand.php

<?php

namespace App\Console\Commands;

use App\Strategies\ExportItemsJob;
use Illuminate\Console\Command;

class ExportItemsCommand extends Command
{
    protected $signature = 'export:items';

    protected $description = 'Export all the items';

    public function handle()
    {
        ExportItemsJob::dispatchSync();
    }
}

Normally these jobs would be pushed to the queue, but for simplicity we'll run them inline for now.

Now we'll create two jobs to export all the data.

<?php

namespace App\Strategies;

use App\Models\Rating;
use Illuminate\Database\Eloquent\Collection;
use League\Csv\Writer;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Storage;

class ExportInteractionsJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public function handle()
    {
        $writer = $this->createExportFile();

        $writer->insertOne($this->personalizeSchema());

        Rating::where('rating', '>=', 3)
            ->chunkById(1000, function ($ratings) use ($writer) {
                $this->createInteractionsExport($writer, $ratings);
            }, $column = 'id');
    }

    private function createExportFile(string $filename = 'interactions-dataset.csv')
    {
        return Writer::createFromPath(
            Storage::disk('samples')->path($filename),
            'w+'
        );
    }

    private function createInteractionsExport(Writer $writer, Collection $ratings)
    {
        $records = $ratings
            ->map(function ($items) {
                return [
                    $items->user_id,
                    $items->movie_id,
                    $items->created_at->timestamp,
                    "rating",
                ];
            });

        $writer->insertAll($records);
    }

    private function personalizeSchema()
    {
        return [
            'USER_ID',
            'ITEM_ID',
            'TIMESTAMP',
            'EVENT_TYPE'
        ];
    }
}

AWS Personalize does not understand negative feedback. With this dataset it would be helpful to remove any interactions that have less than 3 stars.

<?php

namespace App\Strategies;

use App\Models\Movie;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Database\Eloquent\Collection;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Support\Facades\Storage;
use League\Csv\Writer;

class ExportItemsJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public function handle()
    {
        $writer = $this->createExportFile();

        $writer->insertOne($this->personalizeSchema());

        Movie::with('genres')
            ->chunkById(1000, function ($items) use ($writer) {
                $this->createItemsExport($writer, $items);
            }, $column = 'id');
    }

    private function createExportFile(string $filename = 'items-dataset.csv')
    {
        return Writer::createFromPath(
            Storage::disk('samples')->path($filename),
            'w+'
        );
    }

    private function createItemsExport(Writer $writer, Collection $items)
    {
        $records = $items
            ->map(function ($items) {
                return [
                    $items->id,
                    implode(
                        '|',
                        $items->genres->pluck('name')->toArray()
                    )
                ];
            });

        $writer->insertAll($records);
    }

    private function personalizeSchema()
    {
        return [
            'ITEM_ID',
            'GENRE',
        ];
    }
}

Normally these export jobs would write to your s3 bucket directly. For now we're exporting the file to storage/samples/interactions-dataset.csv and storage/samples/items-dataset.csv.

You can run both artisan commands with:

sail artisan export:items
sail artisan export:interactions

Note: The will take some time to run.

This concludes Part 1 of the series.

In Part 2, we'll setup AWS Personalize with the datasets we've created.

How does AWS Personalize Work?

AWS Personalize Use Cases

User Impressions

Item Exploration

Filtering for events

Filtering based on Metadata

Cold Start