To address the challenges of sparse supervision signals and lack of scene inference capabilities in current autonomous driving VLA models under the traditional imitation learning paradigm, WorldDriveVLA is developed: an end-to-end autonomous driving framework that integrates world models and VLA models. This work employs a native unified multimodal feature representation within an autoregressive model architecture, achieving deep alignment of visual, linguistic, and action information in the semantic space. Through multi-task self-supervised training on large-scale driving data, the model simultaneously learns dynamic modeling of the external world and its own motion planning. Building upon the general understanding and reasoning capabilities of VLA models, this framework further enables explicit inference of future driving scenarios, thereby improving the robustness and interpretability of autonomous driving decision-making and planning.