In the general context of problems of social consensus, we consider an extension of the voter model in which a set of interacting elements (agents) can be in either of two equivalent states (A or B) or in a third additional mixed (AB) state. The model is motivated by studies of language competition dynamics, where the AB state is associated with bilingualism. We search for conditions under which a characteristic time scale for ordering dynamics towards either of two absorbing states in a finite complex network of interactions does not exist. For this, we study networks with mesoscale community structure built up from randomly connected cliques. We find that large heterogeneity at the mesoscale level of the network appears to be a sufficient mechanism for the absence of a characteristic time for the dynamics. Such heterogeneity results in dynamical metastable states that survive at any time scale.