By Aamir Manasawala and Mitchell Koch, Software Engineers
Customers across North America come to DoorDash to discover and order from a vast selection of their favorite stores. Our mission is to surface the best stores for our consumers based on their personal preferences. However, the notion of “best stores” for a consumer varies widely based on their diet, taste, budget, and other preferences.
To achieve this mission, we are building search products that provide a personalized discovery and search experience based on a consumer’s past search and order history with DoorDash. This article details our approach for personalization in our search ecosystem, which has provided a significant lift in conversion from search to checkout.
The three-sided nature of DoorDash platform (involving consumers, dashers and merchants) presents a lot of interesting and unique search challenges in addition to the general search and recommendation related problems. Some challenges include:
We use Elasticsearch to power the consumer search for our website and apps. Elasticsearch is an open source, distributed, Lucene-based inverted index that provides search engine capabilities without reinventing the wheel.
For our search engine there are two primary components:
The first is the indexing module (offline). This component reads the store object from the database (Postgres in our case) and writes it to Elasticsearch for bootstrapping, as well as for partial asynchronous updates on the database store object.
Second is the search module (online). Web and mobile clients call the backend search API with the specified consumer location. A JSON-based Elasticsearch query is constructed at the Django backend to call Elasticsearch. The query is executed inside Elasticsearch to retrieve relevant results, which are deserialized and returned to the client. The Elasticsearch query is primarily designed to achieve two purposes:
Now let’s talk about the ML model training and testing for personalization. For including personalization in Elasticsearch, we define a knowledge-based recommender system over consumer / store pairs. For every consumer we are evaluating how good the recommendation is for each specific store based on the consumer’s order history and page views.
To help us out, let’s define some basic terms (note that Medium doesn’t handle equations well, so apologies in advance for the janky formatting):
The data profile of consumer c_i mainly refers to all the data that we need as a signal for the recommendation model. We store and update d(c_i) for each c_i in the database for it to be consumed in the online pipeline.
The data profile of store s_j is stored in Elasticsearch by the indexing pipeline.
f^k is a feature in the machine learning model and f^k_ij is the specific value for the (c_i, s_j) pair. For example, one feature we include is how much overlap there is between the cuisines the consumer c_i had ordered from in the past and the cuisine of the store s_j. We would include similar features based on viewing store pages, price range, etc. For training, we generate f^k_ij for each i, j such that c_i, s_j are visible to each other from the selection criteria described earlier along with a 0/1 flag, which generates the data in the following format:
[0/1 flag, f^0_ij , f^1_ij , f^2_ij , … f^k_ij …] for each i, j such that s_j falls in selectable range of c_i.
Positive examples (marked as 1 in the data model) are the ones where the consumer c_i ordered from that store and the negatives are the ones where, despite the store being exposed to the consumer, the consumer did not order.
We use this data to compute the probability of consumer c_i ordering from s_j given by:
Probability(c_i orders from store s_j) = 1/(1+e^(-1* ( w_k * f^k_ij)) ) where w_k is the weight of kth feature.
We trained the data using the logistic regression model to estimate w_k for our dataset.
Now let’s discuss how we integrate the personalization piece into the Elasticsearch ecosystem, which serves our app and website in real time. To achieve scoring we have to implement the above mentioned logistic regression scoring function inside Elasticsearch. We accomplished that through the script scoring feature of Elasticsearch, which is used for customized ranking use cases such as ours. This script has access to documents inside Elasticsearch and parameters that can be passed as run time arguments in the Elasticsearch query. The score generated by the script is then used for ranking a script based sorting feature to get the desired ranking.
The following diagram describes the overall architecture depicting offline and online components.
We’ve only scratched the surface with the work we’ve done. Here are some areas that we are working on to make our search engine even better:
If you are passionate about solving challenging problems in this space, we are hiring for the search & relevance team and data infrastructure team. If you are interested in working on other areas at DoorDash check out our careers page. Let us know for any comments or suggestions.