**System of Interest and Motivation for Modeling**

For those unfamiliar, Uber is a SaaS company that develops and maintains a mobile application by the same name. The app enables users to request service from crowd-sourced taxi drivers in real time. The taxi drivers are independent contractors who set their own hours, operate their own vehicles, and route all transactions through the Uber app (1). Uber markets itself a service that enables everyone to have a “private driver.”

Our group found Uber particularly interesting because the system is highly stochastic. Both the number of customer requests *and* the number of drivers on the road are random variables. When demand spikes, Uber raises a “surge pricing” price multiplier (PM) with the intent of stabilizing the system. An excerpt from Uber’s website is below (2):

Uber rates increase to ensure reliability when demand cannot be met by the number of drivers on the road.

Our goal is to be as reliable as possible in connecting you with a driver whenever you need one. At times of high demand, the number of drivers we can connect you with becomes limited. As a result, prices increase to encourage more drivers to become available.

From this statement, it seems that the PM is adjusted with the intent of *increasing* supply, rather than both increasing supply and decreasing demand. Whether or not drivers are actually motivated to come “online” based on the PM is a subject of debate. Professor Nick Diakopoulus (University of Maryland, College Park) has analyzed a sample of Uber data trying to answer this question. (His results suggest that surge pricing serves primarily to redistribute drivers already on the road.)

In order to make the topic in line with project requirements for IEOR 131, my group chose to create a *simplified* model of a system in which demand and supply are stochastic, with a centralized price multiplier as a means of stabilizing the system.

**The Model**

The Price Multiplier (PM) is central to our model. PM is given by max{1, B * Q / (D + 1)}, where B, Q, and D are defined below. Customers arrive less frequently as PM increases, and drivers come into service *more* frequently as PM increases.

Key Variables:

- Q : The number of customers with unassigned requests for service.
- D : The number of drivers on the road.
- DMU : Mean time for a driver to arrive with PM = 1.
- LMU : Mean length of a driver’s shift in minutes.
- C : Mean time for a customer to arrive with PM = 1.
- B : Our decision variable and means of enforcing system stability. A high B ( > > D ) will cause demand to drop and supply to increase significantly whenever Q is nonzero. A small B (about 1) will result in more stable prices, but may also result in poor service.

Arrival processes:

- Inter-arrival times for a customer request for service are exponential with mean C * PM.
- Inter-arrival times for drivers to become active are exponential with mean DMU / PM.

Service process:

- Time from [service accepted] to [customer picked up] is max{1, Norm{5, 1}}, where “Norm{m, d}” denotes a normally distributed random variable with mean “m” and standard deviation “d.”
- Time from [customer picked up] to [customer dropped off] is Uniform{5, 15}.

Departure process:

- Time from [driver becomes active] to [driver shift ends] is max{30, Norm{LMU, 0.2 * LMU}}

**Analysis and Results**

The image above shows the first set of inputs fed into our “Uber” model. A Java program computed the cartesian product of these inputs and wrote the result to a text file (for batch processing by the simulation engine (Sigma). Sigma output was imported into MATLAB where we performed the following analysis.

Initially, we measured system stability by variation in PM- we thought that a stable system was characterized by stable prices. To quantify variation in PM, we choose the coefficient of variation. The figure below shows significant tendency toward B = 1. But while developing the model it seemed as though higher values of B resulted in Q closer to zero and more stable number of drivers on the road. We considered minimizing Q as an alternate metric and see a complete shift in the prefered value for B (near B = 19).

**Citations**

(1) https://get.uber.com/drive/

(2) https://help.uber.com/h/6c8065cf-5535-4a8b-9940-d292ffdce119

]]>I came to Berkeley in Fall 2012 as an undergraduate in Industrial Engineering and Operations Research (IEOR). Starting my fifth semester I began to take courses in core CS, upper division mathematics, and graduate-level operations research. I spent this past summer developing constant-factor approximation algorithms for scheduling MapReduce jobs (at the University of Maryland, College Park), and this December I applied to PhD programs across the country (majority CS with some OR). I’ve been offered admission to Cornell’s Operations Research PhD program, and I’ve been invited to the visit day for Caltech’s Computing + Mathematical Sciences PhD program.

Given that I’ve yet to start graduate school, I haven’t determined where my research will end up leading. Here’s a non-exhaustive list of interests:

- Algorithms for graphs.
- Computational biology.
- Algorithms for unconstrained optimization.
- Algorithms for constrained optimization (particularly interior point methods).
- Constant-factor approximation algorithms.
- Theoretical aspects of distributed and parallel computation.
- Control (as in “control theory”) of high performance computing systems.

My main reason for taking CS 267 is to make progress on the last two points on that list. I think this course is a good starting point precisely *because* it is application focused; understanding what high performance computing actually looks like is critical in selecting research projects in the HPC domain which have both theoretical significance and practical interest.

The rest of this post will provide a high level overview of an exciting paper that appeared in a recent supercomputing conference: *Parallel Distributed Memory Construction of Suffix and Longest Common Prefix Arrays *(SC ’15).

Many problems in computational biology involve searching a massive reference text for exact or inexact matches of a large number of shorter query strings. One of the most common reference texts is the human genome – represented as a sequence of approximately 10^9 characters. The set of query strings varies greatly with the application in mind, but will often contain more than 10^6 “short reads” of approximately 100 characters each.

Under these circumstances, we cannot afford to search the entire reference text for each query. Instead, we preprocess the reference text to support querying that is linear not in the size of the reference text, but in the size of *the query string*. That is, given a single short read of 100 characters and a reference text of 10^9 characters, we can correctly find a match (or determine that one does not exist) in approximately *100 steps*, rather than *1,000,000,000* steps.

The most fundamental data structure for this type of work is the Suffix Tree. The suffix tree of a string X is characterized by the fact that every suffix of X is spelled out by a unique path from the root to a leaf corresponding to that suffix. There are other requirements for suffix trees, but “every suffix gets a unique root-to-leaf path” is the most conceptually important. To get an idea of what the suffix tree of a given word looks like, check out this app on VisuAlgo.*

Suffix trees are great, but link-based data structures take up more memory and can be slightly slower than appropriately designed array-based data structures. In view of this, Manber and Meyers (1990) proposed an array-based alternative to suffix trees that they called *Suffix Arrays*.

The suffix array “SA” of a string X satisfies the property that SA[i] is the *location in X* of the *i*-th smallest suffix of X (where a stuff X[a:end] is smaller than X[b:end] if X[a:end] comes before X[b:end] in an alphabetically sorted list). So SA[0] identifies the suffix of X that (in the sorted-list sense!) is smaller than every other suffix in X. Similarly, SA[len(X)-1] identifies the suffix of X that is larger than every other suffix in X. More generally, if you wanted the i-th smallest suffix of X in hand (i.e. not just an index specifying a location), you would write X[SA[i]:end]. You can also play with Suffix Arrays on VisuAlgo.net.

On its own, a suffix array does not replace a suffix tree (that is, there exist distinct strings over the same alphabet with identical suffix arrays, while the same is not true for suffix trees). To remedy this, Manber and Meyers proposed an *additional* data structure that can be combined with suffix arrays to capture all the information contained in suffix trees. The supplementary data structure is called the Longest Common Prefix (LCP) Array, and is defined as follows (for a reference text “X”)

LCP[i] = length_of_longest_common_prefix(X[SA[i-1]:end], X[SA[i]:end].

That is, LCP[i] is the number of uninterrupted matches among leading characters of the (i-1)-st smallest and i-th smallest suffixes. (Again, “smallest” here means “earliest in the order of an alphabetically sorted list”.) LCP[0] does not have an interpretation, and is usually initialized to 0.

That’s a lot of background, but we’re now ready to dive into the paper itself! As is clear from the title, the authors developed a parallel algorithm for constructing suffix arrays and LCP arrays.

* You can use any set of letters you want in the VisuAlgo app, but your string must end with a unique delimiting character (they use the dollar sign, $). The delimiting character is necessary because some strings *don’t have suffix trees*. One simple example is the word “banana”. Banana violates one of the suffix tree requirements (I won’t say which) since the suffix “na” is a prefix of the suffix “nana”.

Pretty much all algorithms for suffix trees, suffix arrays, and LCP arrays get around this by appending a delimiting character (usually “$”) to the reference text. The delimiting character must satisfy the property that it is the lexicographically smallest character among all possible characters in the string. The dollar sign is used as the delimiting character precisely because [$ < anyLetterOrNumber] will evaluate to True for the default ASCII-ID based character ordering.

*We use abbreviate “parallel distributed memory” by PDM.

The original Suffix & LCP Array paper (by Manber and Meyers, 1990) showed how to construct these arrays in O(n log(n)) time on a sequential processor. Subsequent work by Ko and Aluru (2003) developed an O(n) scheme for sequential processors. Independent work by Karkkainen and Sanders (also 2003) developed a very different algorithm that attained the same O(n) runtime.

PDM suffix array construction was first studied by Futamura, Aluru, and Kurtz in 2001. Their approach suffered from two drawbacks (1) the entire reference text needed to be kept in memory at each processor, and (2) algorithm runtime was O(n^2) in the worst case. In 2014, Abdelhadi et al. extended this algorithm to the MPI model.

The problem has also been examined for shared-memory machines (SMM’s). Two research groups have examined SMM suffix array construction with traditional CPU’s, but both saw speedup factors < 2. In 2014, Shun published a paper (*Fast Parallel Computation of Longest Common Prefixes*) detailing a number of methods to construct the LCP array * given the suffix array*. Shun’s algorithms demonstrated speedup factors between 20 and 50. The suffix array construction problem under SMM’s has also been studied on GPU’s. Deo and Keely saw speedups up to 35x, and Osipov saw speedups between 6x and 18x.

In 2007, Kulla and Sanders developed a PDM suffix array construction algorithm that they described as “completely independent of the model of computation.” The implementation they analyzed in detail was based on the MPI model. The interested reader is encouraged to review the original paper, but suffice it to say (1) the runtime dependence on *n* was no worse than O(n log(n)), and (2) the algorithm included an *additive complexity bottleneck* – on the order of p^2 * log(p) * log(n) where p is the number of processors used in parallel.

The primary paper considered in this post (i.e. Flick and Aluru, 2015) is similar to the Kulla-Sanders paper, but differs in the following ways (1) Flick and Aluru construct the LCP array concurrently with the suffix array, where Kulla and Sanders do not construct the LCP array, and (2) Flick and Aluru remove the *additive* bottleneck term : p^2 log(p) * log(n), at the expense of introducing a *multiplicative* factor of log(n). The second of these distinctions enables the algorithm of Flick and Aluru to efficiently scale to thousands of processors.

The algorithm was designed for (1) high-core-count, (2) distributed memory machines. Their experimental environment consisted of 100 identical nodes, each with the following specifications:

- 2 Intel E5 2650 processors (2GHz, 8 cores/CPU)
- 128GB memory
- Node-to-node connections with 40GBit Infiniband.

So in total, the cluster contained 1600 cores.

Asymptotically, the only runtime dependence on processors comes from the time to sort *n* items in parallel on *p* processors.

The speedup factors substantially exceed those in all known prior work. The tables and figures below summarize the performance of the algorithm in a variety of ways.

Table 1 shows how alternative suffix + LCP array construction methods compare to the authors’ proposed method. *divsusort* is the fastest-known open source implementation of a sequential suffix + LCP array algorithm. *mkESA* is one of the shared-memory algorithms mentioned earlier (with speedup < 2). *cloudSACA *is a recent implementation of the 2001 O(n^2) parallel algorithm discussed earlier.

Table 2 shows how runtime of the algorithm scales with processors in *absolute time*. The most striking fact is the ability to create the suffix array of the entire human genome (3 billion base-pairs) in under 5 seconds (though this is without the LCP array).

The plot below demonstrates just how well the proposed algorithm scales. As stated before, *divsusort* is the fastest-known open source implementation of a sequential suffix + LCP array algorithm (it represents the dotted horizontal line near the bottom of the graph. The plot demonstrates that when using 64 of the 100 available nodes, the proposed algorithm outperforms *divusort* by a factor greater than 100 (the paper states that the precise factor was just over 110). When using 64 nodes (1024 cores), the local inputs were only 3MB in size.

The authors suggest exploring how empirical performance of their algorithm could be enhanced with hybrid parallelism. Local operations (particularly the sorting that occurs at each node) may be possible to speed up with shared-memory parallel implementations (via OpenMP) or GPU’s (via CUDA or OpenCL).

]]>In an effort to make myself a better informed Industrial Engineer, and to provide some useful information to the department, I organized a group of Alpha Pi Mu members to do some digging and learn how other IE* departments structure their programs.

Analogs to required UCB IEOR courses: This document organizes courses by category, and covers courses that are required by UC Berkeley’s Department of IEOR. Some elective requirements are set in such a way that a topic (e.g. production systems) cannot be avoided, but unless the course is *specifically* required- it is not included in this document.

Courses Required by Programs Other Than UC Berkeley’s Department of IEOR: This document organizes courses by category, and covers courses *not* directly required by UC Berkeley’s Department of IEOR.

IE Curriculum Research for sharing: This document covers all schools surveyed, and presents that schools coursework in chronological order (when an order was available).

Official Program Requirements: These are *some *of the source documents used to construct the three documents above.

*In fact, IE, IEOR, ORIE, IME, ISyE, ISE, OR, MS&E, and a few other acronyms were covered.

]]>