## Abstract

Network meta-analysis (NMA) as the quantification of pairwise meta-analysis in a network format has been of particular interest to medical researchers in recent years. As a powerful tool with which direct and indirect evidence from multiple interventions can be synthesized simultaneously in the study and design of clinical trials, NMA enables inferences to be drawn about the relative effect of drugs that have never been compared. In this way, NMA provides information on the hierarchy of competing interventions for a given disease concerning clinical effectiveness, thus giving clinicians a comprehensive picture for decision-making and potential avoidance of additional costs. However, estimates of treatment effects derived from the results of network meta-analyses should be interpreted with due consideration of their uncertainty, because simple scores or treatment probabilities may be misleading. This is particularly true where, given the complexity of the evidence, there is a serious risk of misinterpretation of information from aggregated data sets. For these reasons, NMA should be performed and interpreted by both expert clinicians and experienced statisticians, while a more comprehensive search of the literature and a more careful evaluation of the body of evidence can maximize the transparency of the NMA and potentially avoid errors in its interpretation. This review provides the key concepts as well as the challenges we face when studying a network meta-analysis of clinical trials.

- Network meta-analysis
- Bayesian network models
- Markov Chain Monte Carlo algorithms
- indirect evidence
- treatment effects
- treatment network
- review

Meta-analysis is a highly recognized scientific discipline capable of providing high-quality evidence in medical research, particularly in clinical oncology (1-3). The main goal of a meta-analysis is to provide a definitive answer to the fundamental clinical research question about which treatment is most effective when it has been evaluated by multiple studies but with inconsistent results (4). In particular, the purpose of meta-analysis steps in clinical trials, as in any intervention study in general, is to calculate the true effect size of a specific treatment, that is, the same type of intervention compared with similar control groups. Therefore, it is possible to assess whether a particular type of treatment is effective. For the strategy followed in randomized controlled trials (RCTs), pairwise meta-analysis is a well-known statistical tool for synthesizing evidence from multiple trials but refers only to the relative effectiveness of two specific interventions. Therefore, the utility of pairwise meta-analyses is very limited in medical reality. Because there are usually many competing interventions for a given disease, studies related to some of the pairwise comparisons may be missing. Only a small percentage of these have been examined in head-to-head studies. For these reasons, needs have led to the development of network meta-analysis (5-7) (NMA), which is also called mixed treatment comparisons (8-11) (MTC) and may provide more accurate estimates of treatment effects than a pairwise meta-analysis (12). Especially when comparisons between important treatments are missing (13), NMA may be a more useful technique, as it is used to compare multiple treatments simultaneously in a statistical study, whereby combining direct and indirect evidence in a network of randomized controlled trials (RCTSs) (13-16), by providing a more complete picture to clinicians and thus enabling them to more clearly ‘rank’ treatments using summary results. This is achieved by assessing a composite (mixed) effect size as the weighted average of these direct and indirect components, which then allows competing interventions to be ordered more clearly according to their relative effectiveness, even if they have not been compared in a single trial (17, 18).

In recent years this statistical approach has matured as a technique (19, 20), where models are available for raw data that produce different aggregated outcome measures, using both frequentist and Bayesian models through statistical software packages (16). Especially in the last decade, many applications have been published (21, 22), as there are methodological developments in the subject of NMA. The study of the concept of NMA came to the fore to ‘open wider horizons’ for clinicians, by drawing information from the evaluation of a connected network of studies comparing the results of several interventions simultaneously (23). This approach has gained great popularity among clinicians and decision-makers because the costs involved in the development of new or unnecessary clinical studies may be reduced.

The study of an NMA model during the approval process of a drug can make a decisive contribution to the design of a clinical trial by giving accurate information about both the competitive picture and the corresponding evidence so that the information collected can help to ensure that the clinical trial design is the best possible to receive strong support. Consequently, the NMA is a very useful tool for evaluating the comparative effectiveness of different treatments commonly used in clinical practice, provided, however, that appropriate care is taken in the interpretation of the concepts that characterize it so that the results are not biased or bulging (24). Although this technique is increasingly used by biomedical researchers, it has created several challenges and pitfalls that deserve careful consideration, especially since this technique cultivates all the hypotheses of pairwise meta-analysis but with greater complexity due to the multitude of comparisons involved. Moreover, despite the wider acceptance of NMA, there are concerns about the validity of its findings (25). However, as NMA remains a hot research topic to this day, the purpose of this review is to examine the key concepts underlying it, focusing on its risks and benefits, and outlining relevant emerging issues and concerns.

## Network Geometry

In clinical trials it is known that for *n* treatments in NMA the maximum number of designs (*i.e*. each combination of treatments within a study) is 2^{n}-n-1, while for each multi-arm study, there are (*n*¦2)=*n*(*n*-1)/2 comparisons including all possible unique comparisons, even if they are not observed in clinical trials or a pairwise meta-analysis (Figure 1), which would lead to a fully connected network. However, some of the comparisons predicted by the combinatorial formula will be ineligible due to protocol compliance or post hoc limitations (26).

The most important parameter in the utility of a treatment network before relevant data analysis is the assessment of its geometry (27-29), showing which interventions have been directly compared in RCTs and which can only be indirectly compared. In particular, the geometry of the network allows one to understand how many choices there are for each treatment, whether or not certain types of comparisons have been avoided, and whether there are particular patterns among the possible choices of the comparators. However, a network can ‘mutate’ over time as more tests are carried out, thus modifying its geometry which must be studied at each evolving step.

## NMA Assumptions

NMA requires the same steps as a conventional meta-analysis but is graphically represented with a network, thus providing direct information about treatments that can be compared with each other and identifying all interventions linked to a common comparator (the linking treatment). For example, two different treatments have been compared with a placebo in different trials. An NMA allows a hypothesis test to be created that compares these active treatments to each other based on their effectiveness against a common comparator (usually a placebo), thus providing ‘indirect’ evidence. These indirect comparisons provide the opportunity to fill the knowledge gaps of efficacy comparisons of existing treatments, thereby providing a more comprehensive understanding of the multitude of treatment options for the clinician. In short, the network estimate is an aggregate result of the direct and indirect evidence for a given comparison or the indirect evidence if no direct evidence is available. Then, once all the treatments in the existing network have been compared, there are different methods for ranking (30-33) the treatments in terms of their net effectiveness.

The main objective of NMA is to examine and statistically validate the effects of each treatment by evaluating and analyzing three or more interventions/treatments using both direct and indirect evidence. Therefore, basic assumptions such as transitivity, consistency, and homogeneity of direct evidence should be satisfied for performing NMA to be valid. More specifically, these assumptions should be evaluated with statistical tests (34). However, these methodological aspects, although poorly understood, are nevertheless key concepts for understanding a network meta-analysis (35, 36). For this reason, we will explain the basic principles governing these assumptions.

## The Concepts of Transitivity, Consistency, and Heterogeneity in NMA

Transitivity (37) is the composition of studies that makes a direct comparison between 2 meta-analytic estimates A *vs*. C and B *vs*. C meaningful when the studies are similar in important clinical characteristics that influence the relevant treatment effects (9) (effect modifiers, *i.e.*, characteristics that influence the relevant outcomes of a clinical intervention) which need not be identical and therefore can be examined by comparing the distribution of potential effect modifiers across the different comparisons (38). Indirect information on the relative effect of 2 interventions will be considered valid if the studies and comparisons in the network do not differ in terms of the distribution of the various effect modifiers (the intervention effects are transitive). A valid indirect comparison (such as AB) requires both AC and BC studies to be similar in terms of the distribution of these characteristics, and only then will the assumption of transitivity apply. Then the indirect comparison (AB) is calculated by subtracting BC from AC as defined by the formula (20, 39):

where *θ* denotes the observed estimates of treatment effects in terms such as odds ratios (OR), mean difference (MD), *etc*. In oncology, time-to-event data (40) are used where the hazard ratio (HR) (41) is taken as the necessary measure to interpret treatment efficacy. The HR is calculated using Cox regression models (42) in the survival analysis and indicates the relative probability of the event (*e.g.*, death). Transitivity, although is an essentially incalculable hypothesis, nevertheless, its validity can be assessed by clinical and epidemiological methods (34), and suitable models have been found through which, with suitable modifications, its valid hypothesis can be ensured (43). However, if the clinical characteristics are different (*e.g.*, different patient populations), then the transitivity assumption is violated, so the estimate of the indirect AB comparison is invalid (44, 45). Furthermore, detecting the absence of transitivity can often be difficult because sufficient details published in clinical trials are not always available to allow a detailed assessment (46).

The transitivity translated into statistical terms (36) is essentially the consistency (or coherence) and occurs when the above abstraction equation is supported by the corresponding data, but it can only be evaluated when there is a loop in the evidence network, that is when there is direct and indirect evidence for a specific comparison of interventions. The basic assumption underlying the validity of indirect and mixed comparisons is that there are no significant differences between trials making different comparisons in addition to the treatments already compared. So, an area that remains open and is one of the biggest challenges in NMA is inconsistency (36, 44, 47) which generally occurs when direct and indirect evidence diverge (37) .

More specifically, the inconsistency may arise from the characteristics of the studies due to their different design or when the estimates of the size of the direct and indirect effects differ (48).

The magnitude of inconsistency in an NMA can be statistically calculated by comparing direct and indirect summative effects in predetermined loops (15, 49) or a network by fitting models that allow and disallow inconsistency (50, 51). There are several methods for measuring inconsistency when suspected (48), such as the Akaike (52) and Deviance (53) information criteria for assessing the goodness of fit of models in frequentist/Bayesian approaches to NMA or meta-regression models (50). Also, several methods for detecting inconsistency in an RCT network include the inconsistency parameter approach (48) and the net heat graphical approach (54, 55). ‘Node splitting’ model methods (56-58) have been reported too in the literature to assess inconsistency in NMA, with any direct comparison excluded from the network and then calculating the difference between these direct and indirect components from the network, while appropriate decision rules have been defined to select only those comparisons belonging to potentially inconsistent loops in the network (57). As mentioned earlier, inconsistency exists when discrepancies between direct and indirect estimates exist, therefore transitivity is a common cause of inconsistency.

Another very important advantage of NMA is its ability to investigate whether there is homogeneity or heterogeneity between the results from different clinical trials in each of the pairwise comparisons it involves, and therefore, assessment of heterogeneity in the results of different trials in each of the pairwise comparisons is important and should be considered. There are many valuable reviews on assessing and dealing with heterogeneity (59, 60) in a network. Heterogeneity in a meta-analysis is usually assessed with Cochran’s Q statistic (61-64) and in particular with Cochran’s generalized Q statistic for multivariate meta-analyses, where it can be used in the context of NMA to quantify heterogeneity across the network, both within trials and between trials (the latter is known as inconsistency). Although heterogeneity variance is often the most difficult parameter to estimate, several alternative approaches to estimating this variance have been explored in NMA studies (65) in recent years such as the use of the I^{2} statistic (62, 66-68) (proportion description between-study variation) or meta-regression models (69, 70) are mainly used to reduce heterogeneity (and inconsistency) between RCTs in the network. Measures have also been considered to assess its confidence in the results of an NMA, where the impact of its variability on the corresponding clinical decisions is analyzed (71). In the special case when variances in between-study heterogeneity are estimated with considerable imprecision (because the data are sparse), including external evidence usually improves the conclusions (72). However, as the power and precision of indirect comparisons included in the NMA study depend on sample size and extensive statistical information, further improvements in methodology should be made.

## Ranking Treatments in NMA

The results of the studies are closely associated with uncertainty, and consequently, we cannot be sure that the treatment is the most effective. But we can determine the probability of a particular outcome about which treatment is best. With Bayesian thinking for each treatment, the probability of having a particular rank is derived from the posterior distributions of all possible treatments. The treatments are then ranked by the area under the cumulative rank curve SUCRA (30), which is a quantification of the overall rank and presents a unique number associated with each treatment. The higher the SUCRA value and the closer to 100%, the higher the probability that a treatment is close to the first place, while the opposite is true when this value is close to 0. To compare treatments in an NMA, a frequent analog of SUCRA -by considering the frequentist perspective- called *P*-score is also used. Both concepts allow the ranking of treatments on a continuous scale of 0-1 (73), while rankograms represent these values graphically (74).

## Bayesian Statistical Inference

In addition to frequentist inference which is arguably more commonly used in most research fields, Bayesian statistics (75, 76) is another very important statistical inference tool, having as its main advantage a framework that properly accounts for uncertainty in variance heterogeneity (77), and at the same time is more flexible as it can handle problems that frequency techniques cannot, such as handling missing data. In addition, it is considered more robust because it gives more precise effect estimates with smaller credible intervals, thus implying that it should not be considered as a competitive method of frequentist statistical analysis, but as an additional tool that contributes to the success of a more significant result.

Bayesian statistics treat the unknown quantities as random variables and assign a prior probability distribution to each of them, whereby specifying a joint probability distribution for the data (*i.e.*, a likelihood) we get a full probability model for the set of observable and unobservable quantities. In a few words, in Bayesian inference, prior beliefs (represented by prior distributions) are combined with existing data to arrive at a posterior distribution (Figure 2). So let us assume that the observed data are represented by y and the unknown parameters by *θ*. Then to have relevant inference for we use Bayes’ theorem (78, 79) to get a posterior distribution for making predictions about future events, *i.e.*, the joint distribution of all the parametric models that depend on the observed data: *p(θ|y)***∝***p(θ)p(y|θ)*. Here *p(y|θ)* is the conditional probability of the data given the model parameters which is known as the likelihood function, while the term *p(θ)* is the probability of certain model parameter values in the population which is the prior distribution. Therefore, the posterior distribution *p(θ|y)* is proportional to the likelihood function times the prior distribution (80, 81).

## Methodology

Bayesian meta-analysis is mainly based on the hierarchical Bayes model, with the basic principles of this model being very similar to the ordinary random-effects model. When fitting Bayesian meta-analysis models, it is critical to test the model for whether it included sufficient iterations for convergence, as well as to perform sensitivity analyses with different prior standards to assess the effect on the overall simulation results. The Markov Chain Monte Carlo (MCMC) algorithm (82) that is used in Bayesian probabilistic models must have found the optimal solution (due to convergence); otherwise, more iterations will have to occur. MCMC simulation plays a very important role here because it allows the estimation of the posterior distributions of the parameters for the results of the NMA (83).

## Software Options for Fitting NMA Models and Assessing Inconsistency

The most popular software R (84) packages accessible and currently available for Bayesian and frequentist inference in NMA are included in Table I. Details on how data is analyzed, its input options, and the corresponding statistical models can be found in each package’s respective manuals, which are also mentioned in the references. However, because most of these packages require strong contact with statistical programming for their use (existence of routines for performing NMA), there are also toolkits based on simple and standard instructions, intended to present the results using only the graphs of the analyses (85).

## An Example of a Network Meta-analysis in Diabetes

The objective of the NMA that was applied in Diabetes disease was considered as an example for the estimation of the relative effects on *HbA1c* (glycated hemoglobin) change to a baseline sulfonylurea therapy in patients with type 2 diabetes, where the mean *HbA1c* change from baseline was used in the study and measurements of *HbA1c* were after a follow-up ranging 3-12 months (113). The studies contained in this data set compared different treatments for blood glucose control in patients with diabetes. The researchers selected 26 studies which consisted of a total of 6,646 patients and 10 drug groups that were acarbose (*acar*), benfluorex (*benf*), metformin (*metf*), miglitol (*migl*), pioglitazone (*piog*), placebo (*plac*), rosiglitazone (*rosi*), sitagliptin (*sita*), sulfonylurea alone (*sulf*), and vildagliptin (*vild*). In the corresponding network, there were 15 different designs (*i.e.*, the set of treatments compared in one study). The data recorded the treatment effect (TE), where the effect was introduced here as MD, the standard error of the effect (seTE), the names of the treatments, and finally the name of each study. The effects measure was the MD in blood plasma glucose concentration, while the fixed effects model was used. The visualization of the network (Figure 3) was done *via* the netmeta R statistical package (108, 114). Based on the ranking of treatments (rankogram) in the network with the *P*-score (73) measurement, the top 3 interventions are rosiglitazone treatment which seems to be the most effective (* ^{1}P-score*=0.978), metformin (

*=0.851), and pioglitazone (*

^{2}P-score*=0.768), while the corresponding SUCRA values have very little deviations (rosiglitazone=0.983, metformin=0.852, and pioglitazone=0.766) (108). However, clinicians and decision-makers should not consider an intervention to be best just because it comes first unless the quality of the evidence used and the confidence in the NMA results are considered (30).*

^{3}P-score## Discussion

In general, NMA can provide increased statistical power when normal network connectivity is possible and sample sizes are sufficient. Mathematical approaches exist to ‘measure’ network connectivity, but raw data are required to calculate these indicators (39, 115). However, inappropriate use of NMA can lead to erroneous results, such as when there is low network connectivity and therefore low statistical power (1, 44, 116) or when the results are derived from indirect data which, although they remain observations, are nevertheless not interpreted with due care (7, 14). Regarding indirect treatment comparisons, there is disagreement among researchers about the validity of their use in decision-making and especially when direct treatment comparisons are also available (117-119). More specifically, it is argued that decisions should not be made based on rank probabilities alone (especially when treatments are not directly compared) as they may be incompletely informed (120), but also because estimates of rank probabilities are extremely sensitive as they are influenced by factors such as an unequal number of studies per comparison in the network, sample size of individual studies, overall network configuration, and effect sizes between treatments. For example, an unequal number of studies per comparison may lead to biased (121) estimates of treatment rank probabilities for each network considered and thus to an incorrect NMA analysis, as a result of increased variability in the precision of treatment effect estimates (122). For these reasons, it is necessary to provide detailed reports on the strategy researchers intend to follow to assess transitivity and consistency and clarify their calculation methods. Clinicians should also always be cautious about effect sizes and treatment rankings because a good ranking does not necessarily mean a clinically important effect size, and on the other hand, treatment rankings derived from NMAs can often show some degree of inaccuracy (123). This is because their uncertainty can be ignored and so the rankings give the illusion that some interventions are better than others when the relative effects are not different from zero beyond chance (28). However, even so, NMA has serious advantages over pairwise meta-analyses. Especially when there are cycles of evidence (loops), the Bayesian NMA approach has been shown to significantly improve effect estimates compared to separate pairwise meta-analyses (124).

Another equally important aspect that should be considered before constructing an NMA and could help researchers to further improve the results is, as mentioned above, the exploring of the geometry of the network, and by extension the number of nodes (treatments) that will be included in the network, because a decision maker may not be interested in all pairwise comparisons of the network. Thus, because the therapeutic effect between two active treatments can often be more influential in decision-making than the relative effects of all active treatments versus placebo, researchers could modify the network by using only a subset of available treatments, namely those that are considered clinically more relevant (36) (the most effective treatments). Otherwise, inclusion in the main studies of data comparing treatments without clinical interest provides additional indirect evidence of clinical interest, which may increase the precision of the estimates (9, 125), but may also lead to additional inconsistencies (126). In network studies, it is common to exclude trials and specific comparators based on a variety of different criteria, because choosing to include all possible interventions ever evaluated in an RCT gives unclear and discouraging conclusions. After all, some trials deviate significantly from others and it is not advisable to combine them in NMA (trial-level outliers), where studies suggest corresponding Bayesian outlier detection measures (127). However, deriving treatments from an NMA can significantly change the effect estimates and thus the probability ranking of the most appropriate treatment. Well-connected treatments appear to have the most influence (128). Consequently, the greatest impact on the results occurs when well-connected nodes are removed and so the most evaluated treatments available for a condition must be considered necessary for a network to be valid. Special care is required when it comes to exclusions of potential nodes, and decisions on eligibility criteria must be carefully justified, because small “mutations” in the geometry of the NMA have a direct impact on the analysis and in turn affect the decision-making process. That is why the ‘node-making’ process has been identified as one of the most important problems in NMA, where different ways of generating treatment nodes could significantly affect the results (129-131). But in addition to network size, it is proposed to incorporate the description of specific graph theory statistical measures to complement graph information (132, 133). Particularly for distinguishing similar NMAs, sensitivity analysis is critical to perform when ‘confounding’ is identified in the initial review to infer the absence of heterogeneity, especially when the studies are few (134).

An also very strong reason that the definition of the nodes is critical, is that the interventions are combined with more than one treatment. It is common for researchers to tend to combine treatment arms, where treatments with different characteristics or patients with different subtypes that cannot belong to the same group are merged as one treatment at a node. This has the goal of increasing the statistical power of the network or connecting the network (1), thus introducing bias into the network (135, 136). The simplest approach would be to analyze each combination as a different node in the NMA. Furthermore, evidence has shown that meta-analysis across multiple smaller RCTs is more valuable than one large RCT (137). As there are always confounding factors in studies that can affect the results, the variation in treatment effect between trials gives a better estimate of the mean effect than an RCT. A simulation study showed that when treatment effects are truly additive, the ‘conventional’ NMA model does not outperform them (138).

A notable venture in NMA that has also taken place very recently and is steadily gaining ground is the incorporation of non-randomized data to assess relative treatment effects, especially in cases with limited randomized data to avoid disconnected network phenomena. By incorporating real-world evidence from non-randomized studies can confirm findings from RCTs, thereby increasing the accuracy of results and empowering the decision-making process (139). Because quality meta-analysis is highly dependent on the availability of individual study data, the use of IPD in NMA is increasingly recognized in the scientific community today. More specifically, the benefits of integrating various proportions of individual patient data (IPD) studies into one NMA and aggregate data (AD) and IPD into the same NMA have been explored. This is because standard NMA methods combine aggregate data from multiple studies, assuming that effect modifiers are balanced across populations (95). New methods such as population fitting methods relax this assumption. One such approach is to analyze IPD from each study in a meta-regression model. IPD-based NMA can lead to increased precision of estimated treatment effects. Additionally, it can help improve network coherence and account for heterogeneity across studies by adjusting participant-level effect modifiers and adopting more advanced models to deal with missing response data. Although the availability of such data is not always feasible, an increased IPD rate has been shown to lead to more accurate estimates for most models (140, 141) and these methods need further evaluation. A typical example is the multilevel network meta-regression (ML-NMR) method as the most recent application, which in this case, is the generalization of NMA for synthesizing data from a mixture of IPD and AD studies that provide estimates for a population decision target (95, 96, 103, 142). This use of meta-analysis, which is also the future of population adjustment, including individual studies, can be extended to areas such as prognostic models and prognostic factors that are particularly important in medical disciplines such as oncology.

## Conclusion

As NMA becomes more and more popular and therefore more influential in the scientific community, familiarity with statistical network concepts will be a one-way street as the demands for transparency and more reliable synthesis of the original data increase. Enriching these data belonging to databases for meta-analysis combined with the opinion of experienced researchers can improve the construction of more reliable predictive models for the desired outcome. But this should be done on the assumption that the construction and study of an NMA should always be based on detailed protocols, as this is the only way to protect against decisions such as the selective use of circumstantial evidence. In any case, NMA as a statistical tool is undoubtedly very useful for evaluating the comparative results of multiple competing interventions in clinical practice and is the ‘next step’ in meta-analysis for further health technology development. However, more specialized training is needed to ensure that the basic methodologies underlying NMAs are understood by health researchers to maximize their ability to interpret and validate these results.

## Footnotes

**Authors’ Contributions**Conceptualization: GB; IP. Literature search: GB. Network analysis: GB. Writing original draft: GB. Critically revised the work: GB; IP. Supervised the study: IP.

**Conflicts of Interest**The Authors declared no potential conflicts of interest in relation to this study.

- Received March 13, 2023.
- Revision received March 27, 2023.
- Accepted March 31, 2023.

- Copyright © 2023, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC-ND) 4.0 international license (https://creativecommons.org/licenses/by-nc-nd/4.0).