Extracting Canadian Nutrition Data

Building on my work in the last post, I am continuing my mission to build a reliable database for retreiving nutrition information for commonly used ingredients. I will use this data in order to add a calculator to estimate nutrition facts for recipes in my recipe management app Root Cellar (Github).

The ComprehensiveFoodDatabase I cleaned up in the last post is derived from the USDA FoodData dataset. While it has a huge variety of branded foods, and a decent number of raw ingredients (albeit with odd names and unintuitive units) I was having a lot of trouble getting good matches for even the common ingredients in my recipes.

I really needed a data source for common raw ingredients used in cooking. I came across the Canadian Nutrient File dataset, so I decided to give that a try next.

Exploring the Canadian Nutrient Files

I started by downloading the files as CSV, and then used DB Browser for SQLite to import the CSV files into snake_case named tables. You can find the resulting SQLite DB here.

 CREATE VIEW canfood AS
SELECT 
	FoodID, description, serving_size,
	avg(NutrientValue) filter(where NutrientName = "ENERGY (KILOCALORIES)") calories,
	avg(NutrientValue) filter(where NutrientName = "FAT (TOTAL LIPIDS)") fat,
	avg(NutrientValue) filter(where NutrientName = "CARBOHYDRATE, TOTAL (BY DIFFERENCE)") carbs,
	avg(NutrientValue) filter(where NutrientName = "PROTEIN") protein,
	avg(NutrientValue) filter(where NutrientName = "FIBRE, TOTAL DIETARY") fiber,
	avg(NutrientValue) filter(where NutrientName = "SUGARS, TOTAL") sugar,
	avg(NutrientValue) filter(where NutrientName = "SODIUM") sodium
FROM
	(SELECT
		food_name.FoodID, food_name.FoodDescription as description, serving_size, 
		nutrient_amount.NutrientName, nutrient_amount.NutrientValue
	FROM (
		SELECT * FROM food_name LEFT JOIN (
			SELECT conversion_factor.FoodID, measure_name.MeasureDescription as serving_size FROM conversion_factor LEFT JOIN measure_name ON conversion_factor.MeasureID = measure_name.MeasureID
		) as serving ON food_name.FoodID = serving.FoodID
	) as food_name
	INNER JOIN (
		SELECT nutrient_amount.FoodID, nutrient_amount.NutrientValue, nutrient_name.NutrientName FROM nutrient_amount INNER JOIN nutrient_name ON nutrient_amount.NutrientID = nutrient_name.NutrientID
	) as nutrient_amount 
	ON food_name.FoodID = nutrient_amount.FoodID) as nutr
GROUP BY FoodID
Canada Nutrient File DB structure.

I acheive my final, clean table called canfood with a series of joins:
1) Food Name <= Nutrient Amount <= Nutrient Name
2) Joined table from step 1 <= Conversion Factor <= Measure name (Serving size)
Then by grouping on food ID, I was able to aggregate the various macronutrient types into their own columns (Note: I only aggregate for a limited number of nutrition values, but many more are available in the database, including for many micronutrients).

This data is great!

Once the data was all assembled in the canfood table, I found that the nutrtion data contained here was a lot more comprehensive for raw ingredients as compared to the dataset from the USDA I explored in my last post. It’s much more suitable for my needs in my recipe manager project.

It was definitely worth the time to migrate the data over! After some more SQL magic, I incorporated the canfood table data into the food_search table in the database I generated in my last post to finalize my comprehensive nutrition dataset. I’ve renamed some of the fields and eliminated some superfluous columns to yield a clean final product. You can access the final full-text search nutrition SQLite DB here.

CREATE VIRTUAL TABLE food_search USING fts5(
    fdc_id,
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    calories, protein, carbs, fat, fiber, sodium, sugar
);

INSERT INTO food_search (
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    calories, protein, carbs, fat, fiber, sodium, sugar
) SELECT
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    energy_amount as calories, protein_amount as protein, carb_amount as carbs, fat_amount as fat, fiber_amount as fiber, sodiumna_amount as sodium, sugarstotalincludingnlea_amount as sugar
FROM usda_branded_column;

INSERT INTO food_search (
    fdc_id, 
    description,
    serving_size, serving_size_unit,
    calories, protein, carbs, fat, fiber
) SELECT
    fdc_id, 
    description,
    serving_amount as serving_size, serving_text as serving_size_unit,
    energy_amount as calories, protein_amount as protein, carb_amount as carbs, fat_amount as fat, fiber_amount as fiber
FROM usda_non_branded_column;

INSERT INTO food_search (
    fdc_id, 
    description,
    serving_size_unit,
    calories, protein, carbs, fat, fiber, sodium, sugar
) SELECT
    FoodID as fdc_id, 
    description,
    serving_size as serving_size_unit,
    calories, protein, carbs, fat, fiber, sodium, sugar
FROM canfood;

Building a Nutrition Calculator

I’m excited to report that using this data, I’ve got a functional nutrition calculator up and running on Root Cellar (Github). I’m now having much more success matching ingredients in my recipe app as I have much better coverage of raw ingredients using this database. I have not included the menustat table in my searchable database, as I do not need data on fast-food nutrition but the process is the same as shown above if you wish to incorporate it for your own uses.

Beta Root Cellar Nutrition Calculator

I hope this information/database can help somebody. Thanks again to the USDA, Health Canada, and ComprehensiveFoodDatabase teams for their work on this dataset.

References:
– Whalen, Lexington, Brie Turner-McGrievy, Matthew McGrievy, Andrew Hester, and Homayoun Valafar. “On Creating a Comprehensive Food Database.” In 2022 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1610-1614. IEEE, 2022.
– U.S. Department of Agriculture, Agricultural Research Service. FoodData Central, 2019. fdc.nal.usda.gov.
Canadian Nutrient File, Health Canada, 2015

Extracting Nutrition Data

There’s a relative paucity of good nutrition data sources on the web that are open-source. I’m developing a recipe management system called Root Cellar (Github), and I’m interested in calculating out an estimate of the nutritional content of each ingredient in a recipe. To do so I needed a data source that I could access easily (ideally without calling an API) from a Sveltekit application.

Data Loading Blues

The ComprehensiveFoodDatabase seemed a promising data source. I tried downloading, but was quickly met with errors as Github notified me the Git Large File Storage quota had been exceeded and I would not be allowed to clone the datbase from the repo. I finally found a workaround by forking then archiving a copy of the repo. If you want access to the original dataset you should be able to download it by downloading as a zip from: https://github.com/jpoles1/ComprehensiveFoodDatabase

Next, I wanted to access the data using sqlite3. Unfortunately, I got a bunch of errors when trying to load in the .sql file. I eventually learned this was because the file was compatible with MySQL, but included syntax unsuitable for SQLite. To fix this, I tried a number of scripts and other conversion solutions found on stackoverflow and Github, but all either froze (given the large file size of 7+ GB) or failed.

Eventually I just spun up a MySQL database using Docker and after a couple hours of (very slow) SQL importing, the database was ready. The provided database has the data in both row and column formats. While row format may be useful for some circumstances, it results in a significant duplication of data (columns like description and brand name) inflating file sizes significantly.

MySQL Success => Export

Once imported into MySQL I was able to dump all the tables to csv (to make it easier for others to use for other assorted projects (file sizes aren’t too shabby either). You can download both the sqlite and csv files here. If you want to run SQL queries without any other overhead, I recommend using DB Browser for SQLite to load and explore the data.

I have included 2 sqlite files:

– CompFood.sqlite = 3 tables (menustat, usda_branded_column, usda_non_branded_column)
– CompFoodSearch.sqlite = Above with an added full-text search table (food_search) using FTS5.

Full Text Search

Next, I needed full-text search in order to locate each ingredient from my recipe in the database. SQLite may seem simple at first, but under the hood are a ton of powerful features. Using the FTS5 extension, I am able to easily enable fulltext search across my table. To get started, I generated the food_search table using the following SQL code:

CREATE VIRTUAL TABLE food_search USING fts5(
    fdc_id,
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotal_amount
);

INSERT INTO food_search (
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotal_amount
) SELECT
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotalincludingnlea_unit as sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotalincludingnlea_amount as sugarstotal_amount
FROM usda_branded_column;

INSERT INTO food_search (
    fdc_id, 
    description,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount 
) SELECT
    fdc_id, 
    description,
    serving_amount as serving_size, serving_text as serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount
FROM usda_non_branded_column;

Once created, I can now search the food_search table using SQL like the following:

SELECT * FROM food_search WHERE food_search MATCH '${ingredient}' ORDER BY rank LIMIT 50

For example, I can search for foods containing the word blue cheese, sorting first on rank to get the top matches, then sorting on length to get the shortest food description (which I think is going to select for the most likely matches for my use case).

SELECT DISTINCT * FROM (
	SELECT DISTINCT * FROM food_search WHERE food_search MATCH 'blue cheese' ORDER BY rank LIMIT 50
) ORDER BY length(description) LIMIT 10

I hope this quick tutorial can help you get started with exploring this data set. A huge thank you to the ComprehensiveFoodDatabase team for their work in constructing this dataset. You can read more about the creation of this dataset here: https://arxiv.org/abs/2301.10649

Citation:
Whalen, Lexington, Brie Turner-McGrievy, Matthew McGrievy, Andrew Hester, and Homayoun Valafar. “On Creating a Comprehensive Food Database.” In 2022 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1610-1614. IEEE, 2022.