×

Bandits with global convex constraints and objective. (English) Zbl 1444.90058

Summary: We consider a very general model for managing the exploration-exploitation trade-off, which allows global convex constraints and concave objective on the aggregate decisions over time in addition to the customary limitation on the time horizon. This model provides a natural framework to study many sequential decision-making problems with long-term convex constraints and concave utility and subsumes the classic multiarmed bandit (MAB) model and the bandits with knapsacks problem as special cases. We demonstrate that a natural extension of the upper confidence bound family of algorithms for MAB provides a polynomial time algorithm with near-optimal regret guarantees for this substantially more general model. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well-studied problems/algorithms, such as the Blackwell approachability problem, online convex optimization, and the Frank-Wolfe technique for convex optimization. We give several concrete examples of applications, particularly in risk-sensitive revenue management under unknown demand distributions, in which this more general bandit model of sequential decision making allows for richer formulations and more efficient solutions of the problem.
The online appendices are available at https://doi.org/10.1287/opre.2019.1840.

MSC:

90B50 Management decision making, including multiple objectives
90C25 Convex programming

Software:

AdaBoost.MH
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abbasi-Yadkori Y, Dávid P, Szepesvári C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ, eds. Advances in Neural Information Processing Systems, vol. 24 (MIT Press, Cambridge, MA), 2312-2320.Google Scholar
[2] Abernethy JD, Bartlett PL, Hazan E (2011) Blackwell approachability and no-regret learning are equivalent. PMLR 19:27-46.Google Scholar
[3] Agrawal S, Devanur NR (2015) Fast algorithms for online stochastic convex programming. Proc. 26th Annual ACM-SIAM Sympos. Discrete Algorithms, SODA 2015 (SIAM, Philadelphia), 1405-1424.Crossref, Google Scholar · Zbl 1371.90091 · doi:10.1137/1.9781611973730.93
[4] Agrawal S, Wang Z, Ye Y (2014) A dynamic near-optimal algorithm for online linear programming. Oper. Res. 62(4):876-890.Link, Google Scholar · Zbl 1302.90119
[5] Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J. Machine Learn. Res. 3:397-422.Google Scholar · Zbl 1084.68543
[6] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2/3):235-256.Crossref, Google Scholar · Zbl 1012.68093 · doi:10.1023/A:1013689704352
[7] Azar Y, Felge U, Feldman M, Tennenholtz M (2014) Sequential decision making with vector outcomes. Proc. 5th Conf. Innovations Theoret. Comput. Sci. ITCS ’14 (ACM, New York), 195-206.Crossref, Google Scholar · Zbl 1366.91043 · doi:10.1145/2554797.2554817
[8] Babaioff M, Dughmi S, Kleinberg R, Slivkins A (2012) Dynamic pricing with limited supply. Proc. 13th ACM Conf. Electronic Commerce. EC ’12 (ACM, New York), 74-91.Crossref, Google Scholar · doi:10.1145/2229012.2229023
[9] Badanidiyuru A, Kleinberg R, Singer Y (2012) Learning on a budget: Posted price mechanisms for online procurement. Proc. 13th ACM Conf. Electronic Commerce. EC ’12 (ACM, New York), 128-145.Crossref, Google Scholar · doi:10.1145/2229012.2229026
[10] Badanidiyuru A, Kleinberg R, Slivkins A (2013) Bandits with knapsacks. 54th Annual IEEE Sympos. Foundations Comput. Sci., FOCS 2013 (ACM, New York), 207-216.Crossref, Google Scholar · doi:10.1109/FOCS.2013.30
[11] Barz C, Waldmann KH (2007) Risk-sensitive capacity control in revenue management. Math. Methods Oper. Res. 65(3):565-579.Crossref, Google Scholar · Zbl 1139.91014 · doi:10.1007/s00186-006-0135-8
[12] Besbes O, Zeevi A (2009) Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Oper. Res. 57(6):1407-1420.Link, Google Scholar · Zbl 1233.90011
[13] Besbes O, Zeevi A (2012) Blind network revenue management. Oper. Res. 60(6):1537-1550.Link, Google Scholar · Zbl 1263.90016
[14] Blackwell D (1956) An analog of the minimax theorem for vector payoffs. Pacific J. Math. 6(1):1-8.Crossref, Google Scholar · Zbl 0074.34403 · doi:10.2140/pjm.1956.6.1
[15] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations Trends Machine Learn. 5(1):1-122.Crossref, Google Scholar · Zbl 1281.91051 · doi:10.1561/2200000024
[16] Chakrabarti D, Kumar R, Radlinski F, Upfal E (2009) Mortal multi-armed bandits. Koller D, Schuurmans D, Bengio Y, Bottou L, eds. Advances in Neural Information Processing Systems, vol. 21 (MIT Press, Cambridge, MA), 273-280.Google Scholar
[17] Devanur NR, Jain K, Sivan B, Wilkens CA (2011) Near optimal online algorithms and fast approximation algorithms for resource allocation problems. Proc. 12th ACM Conf. Electronic Commerce. EC ’11 (ACM, New York), 29-38.Crossref, Google Scholar · doi:10.1145/1993574.1993581
[18] Even-Dar E, Mannor S, Mansour Y (2010) Learning with global cost in stochastic environments. COLT 2010-23rd Conf. Learn. Theory, Haifa, Israel, 80-92.Google Scholar
[19] Feldman J, Henzinger M, Korula N, Mirrokni VS, Stein C (2010) Online stochastic packing applied to display ad allocation. Proc. 18th Annual Eur. Conf. Algorithms: Part I. ESA’10 (Springer-Verlag, Berlin), 182-194.Crossref, Google Scholar · Zbl 1287.68186 · doi:10.1007/978-3-642-15775-2_16
[20] Feng Y, Xiao B (2008) A risk-sensitive model for managing perishable products. Oper. Res. 56(5):1305-1311.Link, Google Scholar · Zbl 1167.90378
[21] Ferreira KJ, Simchi-Levi D, Wang H (2015) Online network revenue management using Thompson sampling. Oper. Res. 66(6):1586-1602.Google Scholar · Zbl 1446.90095
[22] Frank M, Wolfe P (1956) An algorithm for quadratic programming. Naval Res. Logist. 3 95110.Crossref, Google Scholar · doi:10.1002/nav.3800030109
[23] Freund Y, Schapire RE (1995) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55(1):119-139.Google Scholar · Zbl 0880.68103
[24] Gönsch J, Hassler M (2014) Optimizing the conditional value-at-risk in revenue management. Rev. Managerial Sci. 8(4):495-521.Crossref, Google Scholar · doi:10.1007/s11846-013-0114-4
[25] Grötschel M, Lovász L, Alexander S (1988) Geometric Algorithms and Combinatorial Optimization (Springer-Verlag, New York).Crossref, Google Scholar · Zbl 0634.05001 · doi:10.1007/978-3-642-97881-4
[26] Guha S, Munagala K (2007) Approximation algorithms for budgeted learning problems. Proc. 39th Annual ACM Sympos. Theory Comput. STOC ’07 (ACM, New York), 104-113.Crossref, Google Scholar · Zbl 1232.68180 · doi:10.1145/1250790.1250807
[27] Kleinberg R, Slivkins A, Upfal E (2008) Multi-armed bandits in metric spaces. Proc. 40th Annual ACM Sympos. Theory Comput. STOC ’08 (ACM, New York), 681-690.Crossref, Google Scholar · Zbl 1231.91048 · doi:10.1145/1374376.1374475
[28] Lancaster J (2003) The financial risk of airline revenue management. J. Revenue Pricing Management 2(2):158-165.Crossref, Google Scholar · doi:10.1057/palgrave.rpm.5170061
[29] Madani O, Lizotte DJ, Greiner R (2004) The budgeted multi-armed bandit problem. Learning Theory (Springer, New York), 643-645.Crossref, Google Scholar · Zbl 1078.68640 · doi:10.1007/978-3-540-27819-1_46
[30] Mannor S, Tsitsiklis JN, Jia YY (2009) Online learning with sample path constraints. J. Machine Learn. Res. 10:569-590.Google Scholar · Zbl 1235.68173
[31] Nesterov Y (2005) Smooth minimization of non-smooth functions. Math. Programming 103(1):127-152.Crossref, Google Scholar · Zbl 1079.90102 · doi:10.1007/s10107-004-0552-5
[32] Pandey S, Olston C (2006) Handling advertisements of unknown quality in search advertising. Adv. Neural Inform. Processing Systems 19:1065-1072.Google Scholar
[33] Shalev-Shwartz S (2012) Online learning and online convex optimization. Foundations Trends Machine Learn. 4(2):107-194.Crossref, Google Scholar · Zbl 1253.68190 · doi:10.1561/2200000018
[34] Singh P (2011) Risk in revenue management. Informs Analytics Magazine (May/June), http://analytics-magazine.org/risk-in-revenue-management/.Google Scholar
[35] Singla A, Krause A (2013) Truthful incentives in crowdsourcing tasks using regret minimization mechanisms. Proc. 22nd Internat. Conf. World Wide Web. WWW ’13 (ACM, New York), 1167-1178.Crossref, Google Scholar · doi:10.1145/2488388.2488490
[36] Slivkins A, Vaughan JW (2014) Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges 12(2):4-23.Crossref, Google Scholar · doi:10.1145/2692359.2692364
[37] Tran-Thanh L, Chapman A, Rogers A, Jennings NR (2012) Knapsack based optimal policies for budget-limited multi-armed bandits. Proc. 26th AAAI Conf. on Artificial Intelligence. AAAI’12, Toronto, Ontario, 1134-1140.Google Scholar
[38] Tran-Thanh L, Chapman A, Munoz de Cote E, Rogers A, Jennings NR (2010) Epsilon-first policies for budget-limited multi-armed bandits. 24th AAAI Conf. Artificial Intelligence, Atlanta.Google Scholar
[39] Weatherford LR (2004) EMSR versus EMSU: Revenue or utility? J. Revenue Pricing Management 3(3):277-284.Crossref, Google Scholar · doi:10.1057/palgrave.rpm.5170114
[40] Xiong H, Xie J, Deng X (2011) Risk-averse decision making in overbooking problem. J. Oper. Res. Soc. 62(9):1655-1665.Crossref, Google Scholar · doi:10.1057/jors.2010.130
[41] Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. Machine Learn.,
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.