Simulating Stability of an Uber-Esque Transit System

IEOR 131 (Discrete Event Simulation) is a core course in the department of IEOR at UC Berkeley. The course requirements include a term project to model and analyze a system of one’s choice. I worked with a group of 4 fellow IEOR students to model and analyze a simple system meant to resemble the dynamics of a decentralized transportation system like Uber.

System of Interest and Motivation for Modeling

For those unfamiliar, Uber is a SaaS company that develops and maintains a mobile application by the same name. The app enables users to request service from crowd-sourced taxi drivers in real time. The taxi drivers are independent contractors who set their own hours, operate their own vehicles, and route all transactions through the Uber app (1). Uber markets itself a service that enables everyone to have a “private driver.”

Our group found Uber particularly interesting because the system is highly stochastic. Both the number of customer requests and the number of drivers on the road are random variables. When demand spikes, Uber raises a “surge pricing” price multiplier (PM) with the intent of stabilizing the system.  An excerpt from Uber’s website is below (2):

Uber rates increase to ensure reliability when demand cannot be met by the number of drivers on the road.

Our goal is to be as reliable as possible in connecting you with a driver whenever you need one. At times of high demand, the number of drivers we can connect you with becomes limited. As a result, prices increase to encourage more drivers to become available.

From this statement, it seems that the PM is adjusted with the intent of increasing supply, rather than both increasing supply and decreasing demand. Whether or not drivers are actually motivated to come “online” based on the PM is a subject of debate. Professor Nick Diakopoulus (University of Maryland, College Park) has analyzed a sample of Uber data trying to answer this question. (His results suggest that surge pricing serves primarily to redistribute drivers already on the road.)

In order to make the topic in line with project requirements for IEOR 131, my group chose to create a simplified model of a system in which demand and supply are stochastic, with a centralized price multiplier as a means of stabilizing the system.

The Model

The Price Multiplier (PM) is central to our model. PM is given by max{1, B * Q / (D + 1)}, where B, Q, and D are defined below. Customers arrive less frequently as PM increases, and drivers come into service more frequently as PM increases.

Key Variables:

  • Q : The number of customers with unassigned requests for service.
  • D : The number of drivers on the road.
  • DMU : Mean time for a driver to arrive with PM = 1.
  • LMU : Mean length of a driver’s shift in minutes.
  • C : Mean time for a customer to arrive with PM = 1.
  • B : Our decision variable and means of enforcing system stability. A high B (  > > D ) will cause demand to drop and supply to increase significantly whenever Q is nonzero. A small B (about 1) will result in more stable prices, but may also result in poor service.

Arrival processes:

  • Inter-arrival times for a customer request for service are exponential with mean C * PM.
  • Inter-arrival times for drivers to become active are exponential with mean DMU / PM.

Service process:

  • Time from [service accepted] to [customer picked up] is max{1, Norm{5, 1}}, where “Norm{m, d}” denotes a normally distributed random variable with mean “m” and standard deviation “d.”
  • Time from [customer picked up] to [customer dropped off] is Uniform{5, 15}.

Departure process:

  • Time from [driver becomes active] to [driver shift ends] is max{30, Norm{LMU, 0.2 * LMU}}

Analysis and Results

One set of input parameters for our
One set of input parameters for our “Uber” simulation.

The image above shows the first set of inputs fed into our “Uber” model. A Java program computed the cartesian product of these inputs and wrote the result to a text file (for batch processing by the simulation engine (Sigma). Sigma output was imported into MATLAB where we performed the following analysis.

Initially, we measured system stability by variation in PM- we thought that a stable system was characterized by stable prices. To quantify variation in PM, we choose the coefficient of variation. The figure below shows significant tendency toward B = 1. But while developing the model it seemed as though higher values of B resulted in Q closer to zero and more stable number of drivers on the road. We considered minimizing Q as an alternate metric and see a complete shift in the prefered value for B (near B = 19).

Input-Output Relationships (including optimal "B") for Average Q and Coefficient of Variation of PM
Input-Output Relationships (including optimal “B”) for Average Q and Coefficient of Variation of PM

Citations

(1) https://get.uber.com/drive/

(2) https://help.uber.com/h/6c8065cf-5535-4a8b-9940-d292ffdce119

CS 267 – Homework 0

About Me

I came to Berkeley in Fall 2012 as an undergraduate in Industrial Engineering and Operations Research (IEOR). Starting my fifth semester I began to take courses in core CS, upper division mathematics, and graduate-level operations research. I spent this past summer developing constant-factor approximation algorithms for scheduling MapReduce jobs (at the University of Maryland, College Park), and this December I applied to PhD programs across the country (majority CS with some OR). I’ve been offered admission to Cornell’s Operations Research PhD program, and I’ve been invited to the visit day for Caltech’s Computing + Mathematical Sciences PhD program.

Given that I’ve yet to start graduate school, I haven’t determined where my research will end up leading. Here’s a non-exhaustive list of interests:

  • Algorithms for graphs.
  • Computational biology.
  • Algorithms for unconstrained optimization.
  • Algorithms for constrained optimization (particularly interior point methods).
  • Constant-factor approximation algorithms.
  • Theoretical aspects of distributed and parallel computation.
  • Control (as in “control theory”) of high performance computing systems.

My main reason for taking CS 267 is to make progress on the last two points on that list. I think this course is a good starting point precisely because it is application focused; understanding what high performance computing actually looks like is critical in selecting research projects in the HPC domain which have both theoretical significance and practical interest.

The rest of this post will provide a high level overview of an exciting paper that appeared in a recent supercomputing conference: Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays (SC ’15).

An Application Area : Building Data Structures for Computational Biology

Why does Computational Biology need special data structures? What is the data structure considered in the paper?

Many problems in computational biology involve searching a massive reference text for exact or inexact matches of a large number of shorter query strings. One of the most common reference texts is the human genome – represented as a sequence of approximately 10^9 characters.  The set of query strings varies greatly with the application in mind, but will often contain more than 10^6 “short reads” of approximately 100 characters each.

Under these circumstances, we cannot afford to search the entire reference text for each query. Instead, we preprocess the reference text to support querying that is linear not in the size of the reference text, but in the size of the query string. That is, given a single short read of 100 characters and a reference text of 10^9 characters, we can correctly find a match (or determine that one does not exist) in approximately 100 steps, rather than 1,000,000,000 steps.

The most fundamental data structure for this type of work is the Suffix Tree. The suffix tree of a string X is characterized by the fact that every suffix of X is spelled out by a unique path from the root to a leaf corresponding to that suffix. There are other requirements for suffix trees, but “every suffix gets a unique root-to-leaf path” is the most conceptually important. To get an idea of what the suffix tree of a given word looks like, check out this app on VisuAlgo.*

Suffix trees are great, but link-based data structures take up more memory and can be slightly slower than appropriately designed array-based data structures. In view of this, Manber and Meyers (1990) proposed an array-based alternative to suffix trees that they called Suffix Arrays.

The suffix array “SA” of a string X satisfies the property that SA[i] is the location in X of the i-th smallest suffix of X (where a stuff X[a:end] is smaller than X[b:end] if X[a:end] comes before X[b:end] in an alphabetically sorted list). So SA[0] identifies the suffix of X that (in the sorted-list sense!) is smaller than every other suffix in X. Similarly, SA[len(X)-1] identifies the suffix of X that is larger than every other suffix in X. More generally, if you wanted the i-th smallest suffix of X in hand (i.e. not just an index specifying a location), you would write X[SA[i]:end]. You can also play with Suffix Arrays on VisuAlgo.net.

On its own, a suffix array does not replace a suffix tree (that is, there exist distinct strings over the same alphabet with identical suffix arrays, while the same is not true for suffix trees). To remedy this, Manber and Meyers proposed an additional data structure that can be combined with suffix arrays to capture all the information contained in suffix trees. The supplementary data structure is called the Longest Common Prefix (LCP) Array, and is defined as follows (for a reference text “X”)

LCP[i] = length_of_longest_common_prefix(X[SA[i-1]:end], X[SA[i]:end].

That is, LCP[i] is the number of uninterrupted matches among leading characters of the (i-1)-st smallest and i-th smallest suffixes. (Again, “smallest” here means “earliest in the order of an alphabetically sorted list”.) LCP[0] does not have an interpretation, and is usually initialized to 0.

That’s a lot of background, but we’re now ready to dive into the paper itself! As is clear from the title, the authors developed a parallel algorithm for constructing suffix arrays and LCP arrays.

Introductory stuff you can gloss over unless you’re really interested or find yourself confused.

* You can use any set of letters you want in the VisuAlgo app, but your string must end with a unique delimiting character (they use the dollar sign, $).  The delimiting character is necessary because some strings don’t have suffix trees. One simple example is the word “banana”. Banana violates one of the suffix tree requirements (I won’t say which) since the suffix “na” is a prefix of the suffix “nana”.

Pretty much all algorithms for suffix trees, suffix arrays, and LCP arrays get around this by appending a delimiting character (usually “$”) to the reference text. The delimiting character must satisfy the property that it is the lexicographically smallest character among all possible characters in the string. The dollar sign is used as the delimiting character precisely because [$ < anyLetterOrNumber] will evaluate to True for the default ASCII-ID based character ordering.

Highlights from Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays (Flick and Aluru)

Where does this work fit in relation to existing literature?

*We use abbreviate “parallel distributed memory” by PDM.

General Suffix Array and LCP Array Construction

The original Suffix & LCP Array paper (by Manber and Meyers, 1990) showed how to construct these arrays in O(n log(n)) time on a sequential processor. Subsequent work by Ko and Aluru (2003) developed an O(n) scheme for sequential processors. Independent work by Karkkainen and Sanders (also 2003) developed a very different algorithm that attained the same O(n) runtime.

PDM suffix array construction was first studied by Futamura, Aluru, and Kurtz in 2001. Their approach suffered from two drawbacks (1) the entire reference text needed to be kept in memory at each processor, and (2) algorithm runtime was O(n^2) in the worst case. In 2014, Abdelhadi et al. extended this algorithm to the MPI model.

The problem has also been examined for shared-memory machines (SMM’s). Two research groups have examined SMM suffix array construction with traditional CPU’s, but both saw speedup factors < 2. In 2014, Shun published a paper (Fast Parallel Computation of Longest Common Prefixes) detailing a number of methods to construct the LCP array given the suffix array. Shun’s algorithms demonstrated speedup factors between 20 and 50. The suffix array construction problem under SMM’s has also been studied on GPU’s. Deo and Keely saw speedups up to 35x, and Osipov saw speedups between 6x and 18x.

Side-by-Side Comparison to Closest Related Work

In 2007, Kulla and Sanders developed a PDM suffix array construction algorithm that they described as “completely independent of the model of computation.” The implementation they analyzed in detail was based on the MPI model. The interested reader is encouraged to review the original paper, but suffice it to say (1) the runtime dependence on n was no worse than O(n log(n)), and (2) the algorithm included an additive complexity bottleneck – on the order of p^2 * log(p) * log(n) where p is the number of processors used in parallel.

The primary paper considered in this post (i.e. Flick and Aluru, 2015) is similar to the Kulla-Sanders paper, but differs in the following ways (1) Flick and Aluru construct the LCP array concurrently with the suffix array, where Kulla and Sanders do not construct the LCP array, and (2) Flick and Aluru remove the additive bottleneck term : p^2 log(p) * log(n), at the expense of introducing a multiplicative factor of log(n). The second of these distinctions enables the algorithm of Flick and Aluru to efficiently scale to thousands of processors.

What architecture is the algorithm designed for? How well does the algorithm scale?

The algorithm was designed for (1) high-core-count, (2) distributed memory machines. Their experimental environment consisted of 100 identical nodes, each with the following specifications:

  • 2 Intel E5 2650 processors (2GHz, 8 cores/CPU)
  • 128GB memory
  • Node-to-node connections with 40GBit Infiniband.

So in total, the cluster contained 1600 cores.

Asymptotically, the only runtime dependence on processors comes from the time to sort n items in parallel on p processors.

How well does the algorithm empirically perform?

The speedup factors substantially exceed those in all known prior work. The tables and figures below summarize the performance of the algorithm in a variety of ways.

Table 1 shows how alternative suffix + LCP array construction methods compare to the authors’ proposed method. divsusort is the fastest-known open source implementation of a sequential suffix + LCP array algorithm. mkESA is one of the shared-memory algorithms mentioned earlier (with speedup < 2). cloudSACA is a recent implementation of the 2001 O(n^2) parallel algorithm discussed earlier.

performance comparisons for different genomes

Table 2 shows how runtime of the algorithm scales with processors in absolute time. The most striking fact is the ability to create the suffix array of the entire human genome (3 billion base-pairs) in under 5 seconds (though this is without the LCP array).

with and without LCP array

The plot below demonstrates just how well the proposed algorithm scales. As stated before, divsusort is the fastest-known open source implementation of a sequential suffix + LCP array algorithm (it represents the dotted horizontal line near the bottom of the graph. The plot demonstrates that when using 64 of the 100 available nodes, the proposed algorithm outperforms divusort by a factor greater than 100 (the paper states that the precise factor was just over 110). When using 64 nodes (1024 cores), the local inputs were only 3MB in size.

speedup graph

Do the authors address any open problems that remain for suffix array or LCP array construction?

The authors suggest exploring how empirical performance of their algorithm could be enhanced with hybrid parallelism. Local operations (particularly the sorting that occurs at each node) may be possible to speed up with shared-memory parallel implementations (via OpenMP) or GPU’s (via CUDA or OpenCL).

A Survey of Industrial Engineering, Operations Research, and Systems Engineering Curricula.

UC Berkeley’s Department of Industrial Engineering and Operations Research is undergoing a strategic planning process to set the department’s direction for the next seven years. In addition to my personal interest in Berkeley’s department of IEOR, I’m also interested in education and curricula in general.

In an effort to make myself a better informed Industrial Engineer, and to provide some useful information to the department, I organized a group of Alpha Pi Mu members to do some digging and learn how other IE* departments structure their programs.

Analogs to required UCB IEOR courses: This document organizes courses by category, and covers courses that are required by UC Berkeley’s Department of IEOR. Some elective requirements are set in such a way that a topic (e.g. production systems) cannot be avoided, but unless the course is specifically required- it is not included in this document.

Courses Required by Programs Other Than UC Berkeley’s Department of IEOR: This document organizes courses by category, and covers courses not directly required by UC Berkeley’s Department of IEOR.

IE Curriculum Research for sharing: This document covers all schools surveyed, and presents that schools coursework in chronological order (when an order was available).

Official Program Requirements: These are some of the source documents used to construct the three documents above.
*In fact, IE, IEOR, ORIE, IME, ISyE, ISE, OR, MS&E, and a few other acronyms were covered.