Author: Zizhun Guo
作者:
写于:
The dataset that we used was downloaded from Yelp. The original dataset has been firstly used for Yelp Dataset Challenge and is kept updating throughout the years. It contains information about reviews, business, users, and business check-ins. The main two datasets we use is business and reviews. We specially focus on 6000 businesses evenly divided in 3 categories with their whole reviews concatenating together that the documents being features. The vocabulary size is around 180,000 which makes our documents source complex and rich in information enough to represent each business. The type of business we select is the restaurant specifically are marked in Sushi Bars, American New, and Fast Food. We believe it is firstly rich in information of the reviews due the Yelp Dataset are mainly focus on Restaurant service collection, and these three categories are not exactly similar whether in the business model or food service it provides.
For each document, we only use 6000 characters (white space inclusive) for training due to the memory limitation while conduct deep learning training hence maintains the same dataset to use for the fair models’ comparison. We conduct 10-fold cross-validation, and we assume each review for each restaurant extracted from JSON dataset file is randomly ordered, but we have ranked the restaurant based on the number of reviews they have so that when conduct concatenation, it guarantees the most reviewed restaurants to be selected as the source restaurants.
This directory contains the actual dataset used for the experiments.
We use Yelp Dataset Business(Business) and Yelp Dataset Reviews(Reviews) to get training and testing samples. The main dataset to use is Reviews. The Business is helpful for tables joining to ensure the data balancing nature. Therefore, each sample attaches one review text. The samples from Reviews nevertheless need to be sampled from unique Business as the goal is to predict the category of the restaurants, so including more restaurants would be reasonable. Link: Yelp Dataset Schema
Plotting histogram on both Business and Reviews help to identify the numbers of restaurants distribution. As the Business Histogram image shows, if we include unique restaurants to filter the Reviews, at most we can select 2363 restaurants for each of three categories(‘Fast Food’, ‘American (New)’, ‘Sushi Bars’) since we want the samples to be balanced. The histogram also shows there exist restaurants that have both categories marked, but these samples’ number is too small to consider.
We can also observe from the Reviews histogram image, the number of samples is much overwhelmed than the Business (one restaurant can have multiple reviews), so if we do not limit our restaurant’s uniqueness condition, we could have samples as large as 268026. In our case, we include 2000 unique restaurants from each category to pre-process, of which the scalability is easy to handle. (In total 6000 samples)
hours text category
0 {'Tuesday': '11:0-22:0', 'Wednesday': '11:0-22... I love Deagan's. I do. I... American (New)
1 {'Monday': '9:0-0:0', 'Tuesday': '9:0-0:0', 'W... Oh happy day, finally ha... Fast Food
2 {'Monday': '10:30-21:0', 'Tuesday': '10:30-21:... This is definitely my fa... Fast Food
3 {'Tuesday': '10:0-17:0', 'Wednesday': '10:0-17... I have been here twice. ... American (New)
4 {'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'... Everything that my husba... Fast Food
5 {'Monday': '11:0-20:0', 'Tuesday': '11:0-21:0'... My husband and I go ther... American (New)
6 {'Monday': '17:0-23:0', 'Tuesday': '17:0-23:0'... As the previous person p... American (New)
7 {'Monday': '0:0-0:0', 'Tuesday': '17:0-22:0', ... This was our choice, by ... Sushi Bars
8 {'Monday': '7:0-22:0', 'Tuesday': '7:0-22:0', ... The employees are so fri... Fast Food
9 {'Monday': '6:0-6:0', 'Tuesday': '6:0-6:0', 'W... This actually used to be... Sushi Bars
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hours 6000 non-null object
1 text 6000 non-null object
2 category 6000 non-null object
dtypes: float64(1), object(3)
As the original labels are categorical labels: Sushi Bars, American New, and Fast Food. It has to transform the categorical labels into numerical labels for the models to learn. We have transformed two types of labels (multiclass vs multilabel) for our learning models to output.
Category before re-label | ‘Fast Food’ | ‘American (New)’ | ‘Sushi Bars’ |
---|---|---|---|
Category after re-label | 1 | 0 | -1 |
index | category | Multiclass | Multilabel(One-hot Encoded) |
---|---|---|---|
0 | Sushi Bars | -1 | 1 0 0 |
1 | Fast Food | 1 | 0 0 1 |
2 | Fast Food | 1 | 0 0 1 |
3 | American (New) | 0 | 0 1 0 |
4 | Fast Food | 1 | 0 0 1 |