1 Introduction
Due to the growing popularity of wireless deployments, especially the ones based on the IEEE 802.11 standard (i.e., WiFi), it is very common to find independent overlapping Wireless Networks (WNs) sharing the same channel resources. Due to the decentralized nature of such kind of deployments, the lack of organization and/or agreement on sharing policies leads to inefficient resources allocation, both in spectral and spatial domains. An illustrative example about the inefficiency noticed in WNs regarding channel sharing can be found in akella2007self , where the authors show that the power level used by wireless devices is typically set, by default, to the maximum, regardless of the distance between communicating nodes, and of the channel occupancy akella2007self . In decentralized environments, providing a model to obtain the optimal solution can be extremely complex. In WNs, the interactions between devices depend on many features (such as position, environment, transmit power) and are hard to derive. On top of this, network dynamics, in terms of user arrivals and departures, makes the resource allocation problem even harder.
In order to address the SR problem in decentralized WNs, we focus attention on Reinforcement Learning (RL), which has recently become very popular to solve many wellknown problems in wireless communications. Some examples can be found for packet routing littman1993distributed , Access Point (AP) selection bojovic2011supervised ; bojovic2012neural , optimal rate sampling combes2014optimal , or energy harvesting in heterogeneous networks miozzo2015distributed . Using online learning, a learner (or agent) obtains data in a sequential order and uses it to predict future goodperforming actions. Online learning is, therefore, useful to cope with complex and dynamic environments. This background encourages us to approach a solution for the decentralized SR problem in WNs through online learning techniques. In particular, we exploit dynamic channel selection and transmit power adjustment for dealing with the performance issues resulting from uncoordinated resource sharing in WNs. While a proper frequency planning allows to reduce the interference between wireless devices, tuning the transmit power adds an extra level of SR that can result in improved throughput and fairness.
From the family of online algorithms, we propose the utilization of MultiArmed Bandits (MABs) BCB12
, which is a wellknown model in the online learning literature for solving resource allocation problems. In MABs, a given agent seeks to learn a hidden reward distribution while maximizing the gains. This is known as the explorationexploitation tradeoff. While exploitation selects actions with the goal of maximizing longterm reward given the current estimate, exploration selects actions with the goal of improving the estimate. Unlike for classical RL, MABs do not consider states
^{1}^{1}1A state refers to a particular situation experienced by a given agent, which is defined by a set of conditions. By having an accurate knowledge on its current situation, an agent can define statespecific strategies that maximize its profits. in general, which can be hard to define for the decentralized SR problem. On the one hand, spatial interference cannot be binary treated, thus leading to complex interactions between nodes. On the other hand, the adversarial setting unleashed by decentralized deployments increases the system complexity, since the state not only depends on the actions taken by a given node, but also on the adversaries behavior.This work extends our previous results presented in wilhelmi2017implications , which were obtained by applying a stateless variation of Qlearning to the SR problem in WNs. Here we generalize the contributions done by implementing several actionselection strategies to allow WNs choose the channel and transmit power to use based on the resulting performance. In this paper, then, we provide a tutorialbased application of MABs to the decentralized SR problem, in order to maximize performance while observing the result from using different SR settings. According to that, we aim to illustrate the main benefits and drawbacks of applying different wellknown learning techniques in a selfish way, i.e., independent WNs base their strategy on their own experienced performance. On the one hand, we evaluate the impact of varying parameters intrinsic to the proposed algorithms on the resulting throughput and fairness. In addition, we analyze the effects of learning selfishly, in order to shed some light on the future of decentralized approaches. Notably, we observe that even though players act selfishly, there are algorithms learn to play actions that enhance the overall performance, some times at the cost of high temporal variability. Considering selfish WNs and still obtaining collaborative behaviors is appealing to typical chaotic and dynamic deployments as we get rid of the need for message passing, or inference of neighboring conditions.
The main contributions of this work are summarized below:

We devise the feasibility of applying MAB algorithms as defined in the online learning literature to solve the resource allocation problem in WNs.

We study the impact of different parameters intrinsic to the actionselection strategies considered (e.g., exploration coefficients, learning rates) on network performance.

We show that there are algorithms learn to play collaborative actions even though the WNs act selfishly, which is appealing to practical application in chaotic and dynamic environments. In addition, we shed some light on the root causes of this phenomena.
The remaining of this document is structured as follows: Section 2 outlines relevant related work. Section 3 introduces the proposed learning algorithms and their practical implementation for the resource allocation problem in WNs. Then, Section 4 presents the simulation scenarios and the considerations taken into account. The simulation results are later presented in Section 5. Finally, Section 6 provides some final remarks.
2 Related Work
In this paper we approach the decentralized SR problem through Dynamic Channel Allocation (DCA) and Transmit Power Control (TPC). While the former aims to allocate the available frequency range among potentially overlapping WNs, the latter attempts to increase the number of parallel transmissions by adjusting the power transmitted by each WN. While DCA operates at the frequency level, TPC deals with more complex spatial interactions. For that, an approach that dynamically tunes the transmit power may severely affect higher communication layers such as network (routing) or transport (congestion). TPC directly affects the transmission range, the data rate and the energy consumption, and it may create unidirectional links that unleash fairness issues. In fact, acknowledgments (ACKs) and RTS/CTS frames assume bidirectionality.
DCA has been extensively exploited from the centralized perspective, especially through techniques based on graph coloring riihijarvi2005frequency ; mishra2005weighted . Despite these kind of approaches allow to effectively reduce the interference between WNs, a certain degree of communication is required. Regarding decentralized methods, the authors in akl2007dynamic propose a very simple approach in which each AP maintains an interference map of their neighbors, so that channel assignment is done through interference minimization. Unfortunately, the interactions between APs due to decentralization are not studied. Separately, chen2007improved proposes two decentralized approaches that rely in the interference measured at both APs and STAs to calculate the best frequency channels for dynamic channel allocation. To do so, a WN, in addition to the interference sensed by its associated devices, considers other metrics such as the amount of traffic, so that some kind of coordination is required at the neighbor level (periodic reporting).
For the case of TPC, one of the first works in studying power management is elbatt2000power , which shows that large improvements can be achieved if selecting the appropriate transmit power level in adhoc WNs. A centralized approach for TPC is presented in tang2014joint , which performs power control and Rate Adaptation (RA) in subgroups of Wireless Local Area Networks (WLANs). The creation of clusters allows defining independent power levels between devices in the same group, which are useful to avoid asymmetric links. However, to represent all the possible combinations, graphs can become very large, specially in highdensity deployments. Furthermore, we find a decentralized approach that relies on realtime channel measurements gandarillas2014dynamic . The proposed mechanism (so called Dynamic Transmission Power Control) is based on a set of triggered thresholds that increase/decrease the transmit power according to the situation. The main problem is that thresholds are set empirically (based on simulations), so the mechanism may not always improve performance.
2.1 Solving WNs Resource Allocation Problems via Bandits
In the online learning literature, several MAB settings have been considered such as stochastic bandits thompson1933likelihood ; lai1985asymptotically ; auer2002finite , adversarial bandits auer1995gambling ; auer2002nonstochastic , restless bandits whittle1988restless , contextual bandits LCLS10 and linear bandits abe2003reinforcement ; APS11 , and numerous explorationexploitation strategies have been proposed such as greedy sutton1998reinforcement ; auer2002finite , upper confidence bound (UCB) lai1985asymptotically ; Agr95 ; BuKa96 ; auer2002finite , exponential weight algorithm for exploration and exploitation (EXP3) auer1995gambling ; auer2002finite and Thompson sampling thompson1933likelihood . The classical multiarmed bandit problem models a sequential interaction scheme between a learner and an environment. The learner sequentially selects one out of actions (often called arms in this context) and earns some rewards determined by the chosen action and also influenced by the environment. Formally, the problem is defined as a repeated game where the following steps are repeated in each round :

The environment fixes an assignment of rewards for each action ,

the learner chooses action ,

the learner obtains and observes reward .
The bandit literature largely focuses on the perspective of the learner with the objective of coming up with learning algorithms that attempt to maximize the sum of the rewards gathered during the whole procedure (either with finite or infinite horizon). As noted above, this problem has been studied under various assumptions made on the environment and the structure of the arms. The most important basic cases are the stochastic bandit problem where, for each particular arm
, the rewards are i.i.d. realizations of random variables from a fixed (but unknown) distribution
, and the nonstochastic (or adversarial) bandit problem where the rewards are chosen arbitrarily by the environment. In both cases, the main challenge for the learner is the partial observability of the rewards: the learner only gets to observe the reward associated with the chosen action , but never observes the rewards realized for the other actions.Let and be the rewards obtained at time from choosing actions (optimal) and , respectively. Then, the performance of learning algorithms is typically measured by the total expected regret defined as
An algorithm is said to learn if it guarantees that the regret grows sublinearly in , that is, if is guaranteed as grows large, or, equivalently, that the average regret converges to zero. Intuitively, sublinear regret means that the learner eventually identifies the action with the highest longterm payoff. Note, as well, that the optimal action is the same across all the rounds. Most bandit algorithms come with some sort of a guaranteed upper bound on which allows for a principled comparison between various methods.
To the best of our knowledge, there is very little related work on applying multiarmed bandit techniques to the problem of resource allocation in WNs. In coucheney2015multi , the authors propose modeling a resource allocation problem in LTE networks through MABs. In particular, a set of Base Stations (BS) learn the best configuration of Resource Blocks (RBs) in a decentralized way. For that purpose, a variation of EXP3 (socalled QEXP3) is proposed, which allows to reduce the strategy set. Despite a regret bound is provided, it is subject to the fact that an optimal resource allocation exists, i.e., every BS obtains the necessary resources. In addition, a large number of iterations is required to find the optimal solution in a relatively small scenario, thus revealing the difficulties shown by decentralized settings.
More related to the problem proposed here, the authors in maghsudi2015joint show a channel selection and power control approach in infrastructureless networks, which is modeled through bandits. In particular, two different strategies are provided to improve the performance of two Device to Device (D2D) users (each one composed by a transmitter and a receiver), which must learn the best channel and transmit power to be selected. Similarly to our problem, users do not have any knowledge on the channel or the other’s configuration, so they rely on the experienced performance in order to find the best configuration. An extension of maghsudi2015joint is provided by the same authors in maghsudi2015channel , which includes a calibrated predictor (referred in the work as forecaster
) to infer the behavior of the other devices in order to counter act their actions. In each agent, the information of the forecaster is used to choose the highestrewarding action with a certain probability, while the rest of actions are randomly selected. Henceforth, assuming that all the networks use a given strategy
, fast convergence is ensured. Results show that channel resources are optimally distributed in a very short time frame through a fully decentralized algorithm that does not require any kind of coordination. Both aforementioned works rely in the existence of a unique Nash Equilibrium, which favors convergence. In contrast, in this article we aim to extend Bandits utilization to denser deployments, and, what is more important, to scenarios with limited available resources in which there is not a unique Nash Equilibrium (NE) that allows fastconvergence. Thus, we aim to capture the effects of applying selfish strategies in a decentralized way (i.e., agent follows a strategy that does not consider the strategies of the others) and we also provide insight about the importance of past information for learning in dense WNs, which has not been studied before.3 MultiArmed Bandits for Improving Spatial Reuse in WNs
The decentralized resource allocation problem in WNs we aim to solve in this work can be modeled as an approximation of contextual bandits. Originally, the contextual bandits problem considers the existence of a context (or a state) that influences the actionselection process. As a consequence, the set of available strategies varies with the context and the probability distribution of a given reward depends both on the chosen actions and also on the context. In the resource allocation problem at hand, the context could include information such as the number of neighboring WNs, their position, their current channel selection, the transmission power they use as well as their actionselection strategies, among other. However, we are interested in the case where no context can be inferred given that such information is hard and costly to obtain in practice, especially in dynamic scenarios. Thus, we model the decentralized SR problem in WNs as an adversarial bandit problem, in which the actions of a given WN affect the payoff distributions of the others’ actions. Such model requires the existence of a NE with a pure strategy,
^{2}^{2}2A pure strategy NE is conformed by a set of strategies and payoffs, so that no player can obtain further benefits by deviating from its strategy. which does not always hold for unplanned deployments due to the scarcity of the available resources and the number of overlapping WNs. Understanding the implications derived from such an adversarial setting in the absence of a NE is one the main goals of this paper, which, to the best of our knowledge, has been barely considered in the previous literature.We model this adversarial problem as follows. Let arm (we denote the size of with K) be a configuration that a WN may choose. Each configuration is a combination of channel and transmit power (e.g., = {Channel: 1, TPC: 5 dBm}), and grants a reward that depends on the others’ configuration. Let be the throughput experienced by at time , and the maximum achievable throughput by in case of isolation (i.e., no interference is experienced in the selected channel). We then define the reward experienced by at time as:
For simplicity, we consider that each WN is composed by an AP and a single STA, so that downlink traffic is held. Note that in typical uncoordinated wireless deployments (e.g., residential buildings), STAs are typically close to the AP to which they are associated. Thus, having several STAs associated to the same AP does not significantly impact the interWNs interference we study in this work. According to that, is computed as , where is the bandwidth, and is the SignaltoNoise ratio experienced by the STA of . We consider this model because it allows to properly capture the interference generated between WNs, which is sought to be minimized.
We have considered the greedy, EXP3, UCB and Thompson sampling actionselection strategies, which are described next in this section. While greedy and EXP3 explicitly include the concepts of exploration coefficient and learning rate, respectively, UCB and Thompson sampling are parameterfree policies that extend the concept of exploration (actions are explored according to their estimated value and not by commitment). Henceforth, we are able to study the effects of these different kind of policies on the WNs behavior. Moreover, we chose the aforementioned policies because they are widely spread and considered of remarkable importance in the MABs literature.
Furthermore, in order to provide a fair comparison, we have considered that all the policies use the same reward. Regarding the learning process, we assume that WNs select actions during the same time slot. The reward experienced by the WNs is computed once all of them have chosen their action for the current iteration. In practice the actionselection process at WNs will most probably not be synchronized but we leave the study of any possible effects of that desynchronization to future work.
3.1 greedy
The greedy policy sutton1998reinforcement ; auer2002finite is arguably the simplest learning algorithm attempting to deal with explorationexploitation tradeoffs. In each round , the greedy algorithm explicitly decides whether to explore or exploit: with probability , the algorithm picks an arm uniformly at random (exploration), and otherwise it plays the arm with the highest empirical return (exploitation).
In case is fixed for the entire process, the expected regret is obviously going to grow linearly as in general. Therefore, in order to obtain a sublinear regret guarantee (and thus an asymptotically optimal growth rate for the total rewards), it is critical to properly adjust the exploration coefficient. Thus, in our implementation of the greedy algorithm, we use a timedependent exploration rate of , as suggested in the literature auer2002finite . The adaptation of this policy to our setting is shown as Algorithm 1.
3.2 Exp3
The EXP3 algorithm auer1995gambling ; auer2002nonstochastic is an adaptation of the weighted majority algorithm of LW94 ; FS97 to the nonstochastic bandit problem. EXP3 maintains a set of nonnegative weights assigned to each arm and picks the actions randomly with a probability proportional to their respective weights (initialized to 1 for all arms). The aim of EXP3 is to provide higher weights to the best actions as the learning procedure proceeds.
More formally, letting be the weight of arm at time , EXP3 computes the probability of choosing arm in round as
where is a parameter controlling the rate of exploration. Having selected arm , the learner observes the generated payoff and computes the importanceweighted reward estimates
for all , where denoting the indicator function of the event taking a value of if is true and otherwise. Finally, the weight of arm is updated as a function of the estimated reward:
where is a parameter of the algorithm often called the learning rate. Intuitively, regulates the rate in which the algorithm incorporates new observations with large values of corresponding to more confident updates and smaller values leading to more conservative behavior. As we did for the exploration coefficient in greedy, we use a timedependent learning rate of auer2002finite . Our implementation of EXP3 to the WN problem is detailed in Algorithm 2.
3.3 Ucb
The upper confidence bound (UCB) actionselection strategy Agr95 ; BuKa96 ; auer2002finite is based on the principle of optimism in face of uncertainty
: in each round, UCB selects the arm with the highest statistically feasible mean reward given the past observations. Statistical feasibility here is represented by an upper confidence bound on the mean rewards which shrinks around the empirical rewards as the number of observations increases. Intuitively, UCB trades off exploration and exploitation very effectively, as upon every time a suboptimal arm is chosen, the corresponding confidence bound will shrink significantly, thus quickly decreasing the probability of drawing this arm in the future. The width of the confidence intervals is chosen carefully so that the true best arm never gets discarded accidentally by the algorithm, yet suboptimal arms are not drawn as few times as possible. To obtain the first estimates, each arm is played once at the initialization.
Formally, let be the number of times that arm has been played, and the throughput obtained by playing arm at time . The average experienced reward by arm at time is therefore given by:
Based on these average rewards, UCB selects the action that maximizes . By doing so, UCB implicitly balances exploration and exploitation, as it focuses efforts on the arms that are the most promising (with large estimated rewards) or not explored enough (with small ). Our implementation of UCB to the WN problem is detailed in Algorithm 3.
3.4 Thompson sampling
Thompson sampling thompson1933likelihood is a wellstudied actionselection technique that had been known for its excellent empirical performance CL11 and was recently proven to achieve strong performance guarantees, often better than those warranted by UCB AG12 ; KKM12 ; KKM13 . Thompson sampling is a Bayesian algorithm: it constructs a probabilistic model of the rewards and assumes a prior distribution of the parameters of said model. Given the data collected during the learning procedure, this policy keeps track of the posterior distribution of the rewards, and pulls arms randomly in a way that the drawing probability of each arm matches the probability of the particular arm being optimal. In practice, this is implemented by sampling the parameter corresponding to each arm from the posterior distribution, and pulling the arm yielding the maximal expected reward under the sampled parameter value.
For the sake of practicality, we aim to apply Thompson sampling using a Gaussian model for the rewards with a standard Gaussian prior as suggested in agrawal2013further . By standard calculations, it can be verified that the posterior distribution of the rewards under this model is Gaussian with mean
and variance
, where is the number of times that arm was drawn until the beginning of round . Thus, implementing Thompson sampling in this model amounts to sampling a parameterfrom the Gaussian distribution
and choosing the action with the maximal parameter. Our implementation of Thompson sampling to the WN problem is detailed in Algorithm 4.4 System model
For the remainder of this work, we study the interactions between several WNs placed in a 3D scenario that occur when applying MABs in a decentralized manner (with parameters described later in Section 4.3). For simplicity, we consider WNs to be composed by an Access Point (AP) transmitting to a single Station (STA) in a downlink manner. Additionally, to analyze the adversarial setting when WNs share scarce resources, we consider that the number of channels is equal to half the number of coexisting WNs, thus making the SR even more challenging.
4.1 Channel modeling
Pathloss and shadowing effects are modeled using the logdistance model for indoor communications. The pathloss between WN and WN is given by:
where is the transmitted power in dBm by the AP in , is the pathloss exponent, is the power in dBm received at the STA in , is the pathloss at one meter in dB, is the distance between the transmitter and the receiver in meters, is the lognormal shadowing loss in dB, and is the obstacles loss in dB. Note that we include the factor , which is the average distance between two obstacles in meters.
4.2 Throughput calculation
As previously mentioned, we use the power received and the interference to calculate the maximum theoretical throughput of each WN at time by using the Shannon Capacity,
where is the channel width and the experienced Signal to Interference plus Noise Ratio (SINR) is given by:
where and are the received power and the sum of the interference at WN at time , respectively, and N is the floor noise power. For each WN, we consider the interference to be the total power received from all the APs in the same channel. For capturing the worstcase scenario, we consider that all WNs are continuously transmitting (i.e., we do not consider empty channel periods left by the medium access procedure). Adjacent channel interference is also considered in , so that the transmitted power leaked to adjacent channels is dBm lower for each extra channel separation.
4.3 Simulation Parameters
According to bellalta2016ax , which provides an overview of the IEEE 802.11ax2019 standard, a typical highdensity scenario for residential buildings contains (i.e., 100 APs in a m area). Accordingly, for simulation purposes, we define a map scenario with dimensions m, containing from 2 to 8 APs. In addition, for the first part of the simulations, we consider a setting containing 4 WNs that form a grid topology. In it, STAs are placed at the maximum possible distance from the other networks.
Table 1 details the parameters used.
Parameter  Value 

Map size (m)  
Number of coexistent WNs  {2, 4, 6, 8} 
APs/STAs per WN  1 / 1 
Distance APSTA (m)  
Number of Channels  2 
Channel Bandwidth (MHz)  20 
Initial channel selection model  Uniformly distributed 
TPC Values (dBm)  {5, 10, 15, 20} 
(dB)  5 
(dB)  Normally distributed with mean 9.5 
(dB)  Uniformly distributed with mean 30 
(meters between two obstacles)  5 
Noise level (dBm)  100 
Traffic model  Full buffer (downlink) 
Number of learning iterations  10,000 
5 Performance Evaluation
In this Section, we evaluate the performance of each actionselection strategy presented in Section 3 when applied to the decentralized SR problem in WNs.^{3}^{3}3All of the source code used in this work open fwilhelmi2017code , encouraging sharing of algorithms between contributors and providing the ability for people to improve on the work of others under the GNU General Public License v3.0 For that purpose, we first evaluate the greedy, EXP3, UCB and Thompson sampling policies in a controlled environment. Without loss of generality, we consider a symmetric configuration in this scenario in order to analyze the competition effects when WNs have the same opportunities for accessing the channel. Then, in Section 5.2, we provide a performance comparison of the aforementioned scenarios with different densities and with randomly located WNs.
5.1 Toy Grid Scenario
We start considering a toy scenario consisting in a grid of 4 WNs, as illustrated in Figure 1. This scenario has the particularity of being symmetric, so that adversarial WNs have the same opportunities to compete for the channel resources. Henceforth, this scenario allows us to i) simplify the evaluation of the proposed algorithms, ii) frame the competition between WNs on equal terms.
5.1.1 Optimal Solution
Before presenting the simulation results of applying a particular actionselection policy, we introduce the optimal solutions that: i) maximize the aggregate throughput and ii) correspond to proportional fairness^{4}^{4}4The proportional fairness solutions satisfy . for the considered grid scenario, shown in Figure 1.
Let actions range from to , and map to {channel number, transmit power (dBm)}: = {1, 5}, = {2, 5}, = {1, 10}, = {2, 10}, = {1, 15}, = {2, 15}, = {1, 20} and = {2, 20}. According to our model, the joint configuration of the 4 WNs that grants the maximum aggregate throughput is any combination of actions {, , }, subject to the condition of two close WNs using . Despite the scenario is symmetric, the distance between APs is not the same in all the cases. For instance, . It is important to emphasize that the proposed scenario does not consider asymmetries between WNs, so that dominance relationships are not present. In case of suffering from such asymmetries, the performance of a given WN would depend on the configuration of its neighboring networks but it won’t be able to affect their behavior. For instance, a WN in the middle of two others may suffer from flow starvation if the WNs in the extremes do not sense each other. In such a situation, the WN in the middle is not able to fairly compete for channel resources due to the asymmetric situation it suffers.
In our symmetric scenario, the optimal aggregate throughput is 1124 Mbps, and is only achieved when two consecutive WNs sacrifice themselves by choosing a lower transmit power (action ). Moreover, to maximize the proportional fairness, the optimal joint configuration changes to any combination using the maximum transmit power (i.e., and ), so that frequency reuse is maximized; WNs using the same channel must be as much separated as possible. Such a configuration leads to an aggregate throughput of 891 Mbps, which is quite lower than in the former case, but allows experiencing a fairer configuration. Note that all WNs obtain the same throughput in this case as frequency reuse is achieved (closer WNs choose different channels), but power is kept equal for each WN and despite experiencing higher interference, none of the WNs sacrifice themselves. The aim of the following performance evaluation is to gain insight into whether a collaborative result, close to this equal division of resources given by the proportional fair action selection, is likely to be learned by the algorithms and if so, under which conditions.
5.1.2 Performance of the greedy policy
We first evaluate the performance of the greedy policy when applied to the WNs problem, which allows us to study the purest form of the exploitationexploration tradeoff. First, we show the impact of modifying the initial exploration coefficient, (recall that we have considered ), on the average throughput experienced by the overlapping WNs. Figure 2 shows the aggregate throughput obtained in the grid scenario when applying greedy during 10,000 iterations, and for each
value between 0 and 1 in 0.1 steps. The average and standard deviation of the throughput from 100 simulation runs are shown, which are compared with the optimal throughput presented in Section
5.1.1.As shown in the figure, the aggregate throughput obtained in average is quite similar for all the provided values of , except for the complete random case where no exploration is done (). The lower the parameter, the less exploration is performed. Consequently, for low , the average throughput is highly dependent on how good/bad were the actions taken at the beginning of the learning process, which results in a higher standard deviation as goes to 0.
Now, in order to provide a deeper analysis on the WNs behavior when implementing the greedy policy, we focus on the actions probability and the temporal evolution of throughput derived from a single simulation run. To this aim, we use , which correspond to three representative cases (low exploration, middle degree of exploration and intensive exploration). Figure 3 shows the action probability distribution at each WN at the end of the 10,000 iterationlong simulation, and for each of the selected values.
(a) = 0.1 
(b) = 0.5 
(c) = 1 
We observe in this figure that more aggressive actions (i.e., the ones that use the maximum transmit power) are chosen with a higher rate in all the WNs as increases. A higher degree of exploration allows WNs to observe the effects of the actions taken by the others, thus acting in consequence. Accordingly, WNs alternate good/poor performance depending on the actions of the others, which at the same time are also selected intermittently. Despite using a high transmit power allows obtaining a better SNR at the receivers, the generated interference at times becomes counterproductive in terms of individual and aggregate throughput.
On the other hand, as decreases (refer to Figure 3(a)), WNs are prone to exploit the same action more frequently, which are the ones that provide the highest known performance. This low degree of exploration results also in a smaller usage of suboptimal actions.
Figure 4 shows the aggregate throughput evolution, which suffers from higher temporal variability as increases. The same observation can be done from the individual throughput evolution (shown in Figure 5), which in addition shows larger differences between the individual patterns for lower values. These differences correspond to unfair performance resulting from a low degree of exploration.
(a) = 0.1 
(b) = 0.5 
(c) = 1 
(a) = 0.1 
(b) = 0.5 
(c) = 1 
To gain further insight into fairness, we depict in Figure 6 the average and standard deviation of the throughput at each WN. We have now concentrated in the last 5,000 iterations of the simulation, disregarding the instability provoked during the transitory regime, so that we get a better picture of the longterm behavior. We see that as increases, the experienced fairness becomes higher due to the temporal fluctuation that occurs as a consequence of performing more exploratory actions. Moreover, we also observe that the standard deviation of the throughput is similar at each WN.
(a) = 0.1 
(b) = 0.5 
(c) = 1 
To sum up, we have seen that tuning clearly entails a tradeoff between the fairness and the temporal variation of throughput. Choosing a low value grants low variability, but WNs are prone to fall into local maximums, as the experienced performance is just a consequence of the randomness in taking particular actions. This also generates a fairness problem, since the unlucky WNs end up experiencing lower performance, as well as they are not able to discover the best throughput actions. On the other side, choosing a high value results in higher throughput and better longterm fairness but at the expense of intermittent good/poor performance.
5.1.3 Performance of the EXP3 policy
We now evaluate the tradeoff between fairness and temporal throughput variability in the EXP3 policy, which includes the learning rate , a parameter that controls how fast old beliefs are replaced by newer ones. In EXP3 we also find the parameter, which regulates explicit exploration by tuning the importance of weights in the actionselection procedure. Setting results in completely neglecting weights (actions have the same probability to be chosen). On the other side, setting , the effect of weights are at its highest. Thus, in order to see clearly the effects of the EXP3 weights, which directly depend on , we fix to 0.
As we did before for greedy, we start analyzing the impact of modifying the input parameters of EXP3 on the average aggregate throughput (Figure 7). We vary the initial learning rate, , between 0 and 1 in 0.1 steps. As it can be observed, setting to 0 results in the lowest average throughput, and a low variability is obtained. In this case, action weights are never updated and the arms are randomly chosen with the same probability. As increases, the proportional fair solution is approximated at the expense of a higher variability between experiments, since less exploration is performed when increases. Note, as well, that the maximum average network throughput is reached for in this particular scenario.
Now, for a single 10,000iteration simulation, we set to closely study the implications of modifying the initial learning rate. Figure 8 compares the resulting actions probability for these low, medium and high values.
(a) 
(b) 
(c) 
We see that actions are tried with similar probability for , since the usage of a low learning rate entails a high similarity between weights. This situation is significantly improved for , in which few actions are selected most of the times. The increased learning rate allows to rapidly capture the effects of the others’ actions, since more exploration is carried out. As a consequence, actions using the highest transmit power are clearly preferred to overcome the adversarial setting. Note, as well, that optimal channel allocation is learned by WNs despite competing with each other. Finally, for , a single action is clearly preferred by each WN. However, unfair situations eventually occur due to the lack of exploration. In particular, we observe that and , which share the same channel, use very different power levels in their favorite actions (5 and 15 dBm, respectively). This, of course, generates a fairness imbalance that favors , which enjoys a higher SINR. As mentioned, this behavior occurs eventually, since the variability in the output can be understood as a consequence of a fast convergence achieved during the transitory learning phase. So, results from different simulations will significantly vary.
Figure 9 shows the temporal aggregate throughput, which suffers from a higher variability when is lower. Regarding individual performance, the same behavior can be observed in the temporal throughput evolution (Figure 10). In this case, setting implies trusting bestperforming actions for larger periods, i.e., higher weights are assigned even if actions are suboptimal. Therefore, as decreases, the temporal variability increases because the actionselection process becomes uniformly random. However, this variability does not entail that the proportional fair solution is approached. Thus, must be carefully set in order to provide a low temporal variability, at the same time that enough exploration is carried out for reaching the optimal fairness solution.
(a) 
(b) 
(c) 
(a) 
(b) 
(c) 
One can notice a fairness imbalance by observing the individual throughput for in Figure 10, which is further illustrated in Figure 11 (again, we have considered the last 5,000 iterations to avoid capturing the transitory phase). The average throughput is more evenly distributed for and , but at the expense of the experienced throughput variability seen above. The unfairness suffered for can be understood as a lack of exploration that prevents finding a fair resource distribution. Thus, we conclude that a similar tradeoff experienced in greedy appears in EXP3 with parameter . The main difference is that, in this case, a random actionselection process is approximated as decreases.
(a) 
(b) 
(c) 
5.1.4 Performance of the UCB policy
Unlike greedy and EXP3, the basic form of UCB does not include any input parameter to regulate variables such as the learning rate or the exploration coefficient. On the contrary, the actionselection process relies in the generated rewards only. We will study in this section how the previously seen tradeoff between fair throughput and variability results for this parameterfree policy.
We start by showing the action probability distribution (shown in Figure 12) derived from the UCB implementation. We observe that WNs clearly prefer playing the actions corresponding to proportional fairness (i.e., the actions with highest transmit power and that result in frequency reuse).
Now, we concentrate in the temporal aggregate throughput (shown in Figure 12(a)). The results show a very high variability, which slowly decreases as the number of iterations increases. The same observations can be done in the individual throughput evolution (Figure 12(b)). As a result, UCB provides longterm fairness but at the expense of a high temporal variability caused by exploration. In Figure 12(c) we show the average throughput experienced by each WN, as well as the standard deviation indicating the suffered variability (only for the last 5,000 iterations). As seen before, one can observe really high fairness but considerable variability due to the temporal evolution of throughput.
Compared to greedy and EXP3, UCB shows a clearer preference for selecting the proportional fair solutions. However, the intermittent good/poor performance of the actions due to the adversarial setting still keeps the degree of exploration very high, resulting in high temporal variability.
5.1.5 Performance of the Thompson sampling policy
Similarly to UCB, in the basic form of Thompson sampling we do not have any tunable parameter. As with the other policies, we use a single 10,000iteration simulation to show in detail the WNs behavior when applying Thompson sampling. We first show the actions probability in Figure 14. Note the wellpronounced preference for the proportional fair actions, which indicates fast convergence and little exploratory operations.
The temporal aggregate throughput is shown in Figure 15. As it can be observed, Thompson sampling achieves a lower variability in the temporal throughput with respect to UCB. The same observations can be done in the individual throughput evolution (Figure 14(b)). The low temporal variability observed on applying Thompson sampling is highlighted in Figure 14(c), which shows that the average experienced throughput enjoys a low standard deviation. Moreover, high fairness among WNs can be observed.
These results show that even though the rewards encourage selfish behavior, the adversarial setting makes Thompson sampling to converge to collaborative actions while keeping the degree of exploration low. Again, this property holds for scenarios where no asymmetries can be found. While the excellent performance we observe echoes other published results that show the superiority of Thompson sampling in various settings LCLS10 , we highlight that this outstanding empirical performance is actually rather unexpected in our nonstochastic setting. Indeed, existing theoretical results only explain the excellent performance of Thompson sampling in stochastic bandit problems. On the other hand, the randomization employed by Thompson sampling is at least superficially related to techniques used in gametheoretic online learning FS97 ; CBLu06:book , with the connection made clear only in a special case Gop13 . The behavior observed in our experiments may indicate a more profound relationship with such gametheoretic algorithms. We leave the investigation of this matter as an exciting direction for future work.
5.2 Random Scenarios
After studying the WNs interactions in the grid scenario, we now evaluate whether the previous conclusions generalize to random scenarios with an arbitrary number of WNs. To this aim, we use the same m scenario and randomly allocate WNs.
We set in greedy, which is the value that granted the highest aggregate throughput in the experiment shown in Figure 2. We do the same for EXP3 and choose , which grants the highest aggregate throughput in average (refer to Figure 7). In Figure 16 we show, for each number of networks and actionselection strategy, an histogram of the mean throughput experienced by each WN during the last 5,000 iterations, and for each of the 100 repetitions.
(a) N = 2 WNs 
(b) N = 4 WNs 
(c) N = 6 WNs 
(c) N = 8 WNs 
As we can observe, histograms of UCB and Thompson sampling are slightly narrower in almost all the cases, which indicates a higher fairness experienced by the WNs with respect to the other policies. Note, as well, that for the , results are similar for all the policies because finding a goodperforming configuration in lowdensity environments is more likely to occur. Thus, fewer exploration is required than for higher densities. Note, as well, that the abovementioned collaborative behavior is still present in random scenarios. In accordance to that, WNs acting selfishly are prone to find a configuration whereby SR is improved in a fair manner. We find the main cause of this issue to be the interWN interference sensed at a given transmitterreceiver pair, which is very likely to be similar in both devices for the proposed scenarios. Such a property is given when STAs are close to their associated AP, thus allowing to palliate the effects of asymmetries.
To conclude, we evaluate temporal variability in random scenarios. For that, Table 2 shows the average standard deviation of the throughput experienced by each WN during the last 5,000 iterations (again, we concentrate on the regime after the initial learning phase). We consider the average results from 100 repetitions. With this, we are able to evaluate, for each network density, the variability in the temporal throughput provided by each of the policies in random topologies.

(Mbps)  

greedy  EXP3  UCB  Thompson s.  
2  12.1314  31.2897  31.9422  12.2079  
4  83.7318  95.4976  73.9136  50.6985  
6  120.8570  81.3259  67.6045  62.8171  
8  130.0322  73.5171  63.2272  68.8547 
As previously seen in the toy grid scenario introduced in Section 5.1, both UCB and Thompson sampling provide the best results in terms of temporal throughput variability. Despite greedy provides very good results for , it does not properly scale up with the number of WNs.
6 Conclusions
In this paper, we provided a tutoriallike implementation of MABs to address the decentralized SR problem in dense WNs. Unlike previous literature, we have focused on a situation in which few resources are available, thus bringing out the competition issues raised from the adversarial setting. Our results show that decentralized learning allows improving SR in dense WN, so that collaborative results in symmetric scenarios, sometimes close to optimal proportional fairness, can be achieved. This result is achieved even though WNs act selfishly, aiming to maximize their own throughput. In addition, this behavior is observed for random scenarios, where the effects of asymmetries cannot be controlled. These collaborative actions are, at times, accompanied by high temporal throughput variability, which can be understood as a consequence of the rate at which networks change their configuration in response of the opponents behavior. A high temporal variability may provoke negative issues in a node’s performance, as its effects may be propagated to higher layers of the protocol stack. For instance, a high throughput fluctuation may entail behavioral anomalies in protocols such as Transmission Control Protocol (TCP). We have studied this tradeoff between fair resource allocation and high temporal throughput variability in greedy, EXP3, UCB and Thompson sampling actionselection strategies. Our results show that while this tradeoff is hard to regulate via the learning parameters in greedy and EXP3, UCB and, especially, Thompson sampling are able to achieve fairness at a reduced temporal variability. We identify the root cause of this phenomena to the fact that both UCB and Thompson sampling consider the probability distribution of the rewards, and not only their magnitude.
We left as future work to further study the MABs application to WNs through distributed (with message passing) and centralized (with complete information) approaches with shared reward. Furthermore, we would like to extend this work to enhance both throughput and stability by inferring the actions of the opponents and acting in consequence. Defining the resource allocation problem as an adversarial game is one possibility to do so. Finally, we intend to particularize to IEEE 802.11 WLANs to study the effects of applying RL when using a medium access protocol such as CSMA. In fact, this would let us consider dynamic Carrier Sense Threshold (CST) to further enable SR.
Acknowledgments
This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM20150502), by the European Regional Development Fund under grant TEC201571303R (MINECO/FEDER), and by a Gift from CISCO University Research Program (CG#890107) & Silicon Valley Community Foundation.
References
 [1] Aditya Akella, Glenn Judd, Srinivasan Seshan, and Peter Steenkiste. Selfmanagement in chaotic wireless deployments. Wireless Networks, 13(6):737–755, 2007.

[2]
Michael Littman and Justin Boyan.
A distributed reinforcement learning scheme for network routing.
In
Proceedings of the international workshop on applications of neural networks to telecommunications
, pages 45–51. Psychology Press, 1993. 
[3]
Biljana Bojovic, Nicola Baldo, Jaume NinGuerrero, and Paolo Dini.
A supervised learning approach to cognitive access point selection.
In GLOBECOM Workshops (GC Wkshps), 2011 IEEE, pages 1100–1105. IEEE, 2011.  [4] Biljana Bojovic, Nicola Baldo, and Paolo Dini. A neural network based cognitive engine for ieee 802.11 WLAN access point selection. In Consumer Communications and Networking Conference (CCNC), 2012 IEEE, pages 864–868. IEEE, 2012.
 [5] Richard Combes, Alexandre Proutiere, Donggyu Yun, Jungseul Ok, and Yung Yi. Optimal rate sampling in 802.11 systems. In INFOCOM, 2014 Proceedings IEEE, pages 2760–2767. IEEE, 2014.
 [6] Marco Miozzo, Lorenza Giupponi, Michele Rossi, and Paolo Dini. Distributed qlearning for energy harvesting heterogeneous networks. In Communication Workshop (ICCW), 2015 IEEE International Conference on, pages 2006–2011. IEEE, 2015.
 [7] Sébastien Bubeck and Nicoló CesaBianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
 [8] Francesc Wilhelmi, Boris Bellalta, Cristina Cano, and Anders Jonsson. Implications of decentralized qlearning resource allocation in wireless networks. arXiv preprint arXiv:1705.10508, 2017.
 [9] Janne Riihijarvi, Marina Petrova, and Petri Mahonen. Frequency allocation for WLANs using graph colouring techniques. In Wireless Ondemand Network Systems and Services, 2005. WONS 2005. Second Annual Conference on, pages 216–222. IEEE, 2005.
 [10] Arunesh Mishra, Suman Banerjee, and William Arbaugh. Weighted coloring based channel assignment for WLANs. ACM SIGMOBILE Mobile Computing and Communications Review, 9(3):19–31, 2005.
 [11] Robert Akl and Anurag Arepally. Dynamic channel assignment in IEEE 802.11 networks. In Portable Information Devices, 2007. PORTABLE07. IEEE International Conference on, pages 1–5. IEEE, 2007.
 [12] Jeremy K Chen, Gustavo De Veciana, and Theodore S Rappaport. Improved measurementbased frequency allocation algorithms for wireless networks. In Global Telecommunications Conference, 2007. GLOBECOM’07. IEEE, pages 4790–4795. IEEE, 2007.
 [13] Tamer A ElBatt, Srikanth V Krishnamurthy, Dennis Connors, and Son Dao. Power management for throughput enhancement in wireless adhoc networks. In Communications, 2000. ICC 2000. 2000 IEEE International Conference on, volume 3, pages 1506–1513. IEEE, 2000.
 [14] Suhua Tang, Hiroyuki Yomo, Akio Hasegawa, Tatsuo Shibata, and Masayoshi Ohashi. Joint transmit power control and rate adaptation for wireless LANs. Wireless personal communications, 74(2):469–486, 2014.
 [15] Carlos Gandarillas, Carlos MartínEngeños, Héctor López Pombo, and Antonio G Marques. Dynamic transmitpower control for WiFi access points based on wireless link occupancy. In Wireless Communications and Networking Conference (WCNC), 2014 IEEE, pages 1093–1098. IEEE, 2014.
 [16] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 [17] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 [18] Peter Auer, Nicoló CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [19] Peter Auer, Nicoló CesaBianchi, Yoav Freund, and Robert E Schapire. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE, 1995.
 [20] Peter Auer, Nicoló CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 [21] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
 [22] L. Li, W. Chu, J. Langford, and R.E. Schapire. A contextualbandit approach to personalized news article recommendation. In WWW, pages 661–670, 2010.
 [23] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
 [24] Y. AbbasiYadkori, D. Pal, and C. Szepesvari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), 2011.
 [25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 [26] R. Agrawal. Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Mathematics, 27:1054–1078, 1995.
 [27] A.N. Burnetas and M.N. Katehakis. Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:122–142, 1996.
 [28] Pierre Coucheney, Kinda Khawam, and Johanne Cohen. Multiarmed bandit for distributed intercell interference coordination. In ICC, pages 3323–3328, 2015.
 [29] Setareh Maghsudi and Sławomir Stańczak. Joint channel selection and power control in infrastructureless wireless networks: A multiplayer multiarmed bandit framework. IEEE Transactions on Vehicular Technology, 64(10):4565–4578, 2015.
 [30] Setareh Maghsudi and Sławomir Stańczak. Channel selection for networkassisted D2D communication via noregret bandit learning with calibrated forecasting. IEEE Transactions on Wireless Communications, 14(3):1309–1322, 2015.
 [31] N. Littlestone and M. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
 [32] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55:119–139, 1997.
 [33] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249–2257. Curran Associates, Inc., 2011.
 [34] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multiarmed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
 [35] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite time analysis. In Algorithmic Learning Theory, pages 199–213. Springer, 2012.
 [36] Nathaniel Korda, Emilie Kaufmann, and Remi Munos. Thompson sampling for 1dimensional exponential family bandits. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1448–1456. Curran Associates, Inc., 2013.
 [37] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial Intelligence and Statistics, pages 99–107, 2013.
 [38] Boris Bellalta. IEEE 802.11 ax: Highefficiency WLANs. IEEE Wireless Communications, 23(1):38–46, 2016.

[39]
Francesc Wilhelmi.
Collaborative spatial reuse in wireless networks via selfih
mulitarmed bandits.
https://github.com/fwilhelmi/collaborative_sr_in_wns_via_selfish_mabs
Commit: 1849597, 2017.  [40] N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
 [41] Aditya Gopalan. Thompson sampling for online learning with linear experts. arXiv preprint arXiv:1311.0468, 2013.
Comments
There are no comments yet.