Simply put, a verbal multi-word expression is a multi-word expression (MWE) that contains a verb. These verbal MWEs can be idiomatic expressions such as "kick the bucket" or "spill the beans", as well as light-verb construction where the non-verb words describe the action as in "have a conversation" (action described by the noun "conversation"). Phrasal verbs like "to look up" (a word in a dictionary) or "to put up" (with something/someone) are also verbal MWEs. And in some languages, like French or Spanish, verbs can also be used in reflexive form, like "se trouver" (to be located), "se dérouler" (to unfold), "se battre" (to fight, to strive), etc., producing verbal MWEs.
We addressed this verbal MWE identification task as a named-entity recognition problem. More specifically, we applied conditional random fields (CRF), a state-of-the-art sequence labelling algorithm that is very successful at recognising named entities.
Another intuition, however, is that many verbal MWEs, like the idiomatic expressions, don't have a literal meaning. So, the meaning of the full verbal MWE will be somewhat unrelated to the meaning of each of its individual components. We exploited this intuition by computing distributional semantic similarity scores of vectors representing the full verbal MWE and its individual words. Our assumption is that the lower these similarity scores are, the less literal the MWE is. We integrated these similarity/literalness scores into a single score using linear regression, which we used to re-rank the top 10 label sequences from CRF, achieving 5-10% gains in F1 scores.
You can get all of the details of our approach in our research paper.
You can also see a video recording by Aaron Li-Feng Han of a talk I gave at the Dublin Computational Linguistics Research Seminar on this topic.