Sign Language Translation (SLT) aims to generate spoken language translations from sign language videos. Currently, the available sign language datasets are relatively too small to learn the linguistic properties of spoken language. In this paper, towards effective SLT, we propose a novel framework which takes the advantage of the spoken language grammar learnt from a large corpus of text sentences. Our framework consists of three key modules word existence verification, conditional sentence generation and cross-modal re-ranking. We first check the existence of words in the vocabulary by a series of binary classification in parallel. After that, the appearing words are assembled and guided by a pretrained spoken language generator to produce multiple candidate sentences in spoken language manner. Last but not least, we select the sentence most semantically similar to the input sign video as the translation result with a crossmodal re-ranking model. We evaluate our framework on two large scale continuous SLT benchmarks, i.e. , CSL and RWTHPHOENIX-Weather 2014 T. Experimental results demonstrate that the proposed framework achieves promising performance on both datasets.