Extracting Nutrition Data

There’s a relative paucity of good nutrition data sources on the web that are open-source. I’m developing a recipe management system called Root Cellar (Github), and I’m interested in calculating out an estimate of the nutritional content of each ingredient in a recipe. To do so I needed a data source that I could access easily (ideally without calling an API) from a Sveltekit application.

Data Loading Blues

The ComprehensiveFoodDatabase seemed a promising data source. I tried downloading, but was quickly met with errors as Github notified me the Git Large File Storage quota had been exceeded and I would not be allowed to clone the datbase from the repo. I finally found a workaround by forking then archiving a copy of the repo. If you want access to the original dataset you should be able to download it by downloading as a zip from: https://github.com/jpoles1/ComprehensiveFoodDatabase

Next, I wanted to access the data using sqlite3. Unfortunately, I got a bunch of errors when trying to load in the .sql file. I eventually learned this was because the file was compatible with MySQL, but included syntax unsuitable for SQLite. To fix this, I tried a number of scripts and other conversion solutions found on stackoverflow and Github, but all either froze (given the large file size of 7+ GB) or failed.

Eventually I just spun up a MySQL database using Docker and after a couple hours of (very slow) SQL importing, the database was ready. The provided database has the data in both row and column formats. While row format may be useful for some circumstances, it results in a significant duplication of data (columns like description and brand name) inflating file sizes significantly.

MySQL Success => Export

Once imported into MySQL I was able to dump all the tables to csv (to make it easier for others to use for other assorted projects (file sizes aren’t too shabby either). You can download both the sqlite and csv files here. If you want to run SQL queries without any other overhead, I recommend using DB Browser for SQLite to load and explore the data.

I have included 2 sqlite files:

– CompFood.sqlite = 3 tables (menustat, usda_branded_column, usda_non_branded_column)
– CompFoodSearch.sqlite = Above with an added full-text search table (food_search) using FTS5.

Full Text Search

Next, I needed full-text search in order to locate each ingredient from my recipe in the database. SQLite may seem simple at first, but under the hood are a ton of powerful features. Using the FTS5 extension, I am able to easily enable fulltext search across my table. To get started, I generated the food_search table using the following SQL code:

CREATE VIRTUAL TABLE food_search USING fts5(
    fdc_id,
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotal_amount
);

INSERT INTO food_search (
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotal_amount
) SELECT
    fdc_id, 
    description, brand_name, brand_owner,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit, sodiumna_unit, sugarstotalincludingnlea_unit as sugarstotal_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount, sodiumna_amount, sugarstotalincludingnlea_amount as sugarstotal_amount
FROM usda_branded_column;

INSERT INTO food_search (
    fdc_id, 
    description,
    serving_size, serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount 
) SELECT
    fdc_id, 
    description,
    serving_amount as serving_size, serving_text as serving_size_unit,
    protein_unit, carb_unit, fat_unit, energy_unit, fiber_unit,
    protein_amount, carb_amount, fat_amount, energy_amount, fiber_amount
FROM usda_non_branded_column;

Once created, I can now search the food_search table using SQL like the following:

SELECT * FROM food_search WHERE food_search MATCH '${ingredient}' ORDER BY rank LIMIT 50

For example, I can search for foods containing the word blue cheese, sorting first on rank to get the top matches, then sorting on length to get the shortest food description (which I think is going to select for the most likely matches for my use case).

SELECT DISTINCT * FROM (
	SELECT DISTINCT * FROM food_search WHERE food_search MATCH 'blue cheese' ORDER BY rank LIMIT 50
) ORDER BY length(description) LIMIT 10

I hope this quick tutorial can help you get started with exploring this data set. A huge thank you to the ComprehensiveFoodDatabase team for their work in constructing this dataset. You can read more about the creation of this dataset here: https://arxiv.org/abs/2301.10649

Citation:
Whalen, Lexington, Brie Turner-McGrievy, Matthew McGrievy, Andrew Hester, and Homayoun Valafar. “On Creating a Comprehensive Food Database.” In 2022 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1610-1614. IEEE, 2022.

Leave a Reply

Your email address will not be published.