NLP methods for improving user rating systems in crowdsourcing forums and speech recognition of less resourced languages

Abstract: We develop NLP and ASR methods (e.g., algorithms, architectures) for solving these problems: biases induced by user rating, ranking, recommendation and search engine algorithms, computational inefficiencies related with conventional syntactic-semantic parsing algorithms, extensive linguistic resource requirements imposed by traditional ASR methods, interoperability issues faced by NLP and ASR components within cross media analysis and audio-video content searchability problems over the Web. User rating systems (URSs) in crowdsourcing forums (CSFs) (e.g., QA) completely rely on solely voting schemes, and fail to incorporate linguistic quality and user competence information. Such potential failure affects the trustworthiness of answers over the Web as search engines are likely biased towards popular (high-voted) answers. That also contagiously affects the quality of the entire QA platforms as other components depend on the accuracy of the underlying URSs. On the other hand, conventional ASR methods present two major challenges: a failure of acoustic models to work within collaborative environments, as these methods only help build models limited to operate in isolation, and a resource related challenge. Significant contributions have been made in our thesis, published on prestigious AI, NLP and ASR venues, and received over 90 citations of our 9 papers. The proposed approaches potentially transform voting based rating to linguistic quality based rating, and shallow linguistic (meta-data) feature based answer quality predictions to deep syntactic-semantic and user competence based, and also single machine sequential fashion syntactic parsing to parallel and distributed cloud based parsing, meta-data based querying of spoken documents to full text querying and searching, as well as sentiment and competence based querying of textual content.Theoretically, we advance the understanding of the relationships between author text and their associated proficiency in performing certain tasks through successive research works to discover the rules governing the conjectured link between them. Also, new bag of word approaches (based on latent topic modeling, syntactic categories and dependency relations) have been proposed. These approaches yield significant accuracy gains over conventional TF-IDF (term frequency–inverse document frequency) based models, and reduce domain dependencies as they potentially capture structural and topical information.