26 noviembre, 2014

LA MINERÍA DE DATOS: BIBLIOGRAFÍAS



                                                             Imagenes: Manouam y Uaeh.
¿Qué és?


 

Descargar Monográfico

La minería de datos o exploración de datos (es la etapa de análisis de “Knowledge Discovery in Databases” o KDD) es un campo de las ciencias de la computación referido al proceso que intenta descubrir patrones en grandes volúmenes de conjuntos de datos. Utiliza los métodos de la inteligencia artificial, aprendizaje automático, estadística y sistemas de bases de datos. El objetivo general del proceso de minería de datos consiste en extraer información de un conjunto de datos y transformarla en una estructura comprensible para su uso posterior. Además de la etapa de análisis en bruto, que involucra aspectos de bases de datos y gestión de datos, procesamiento de datos, el modelo y las consideraciones de inferencia, métricas de Intereses, consideraciones de la Teoría de la complejidad computacional, post-procesamiento de las estructuras descubiertas, la visualización y actualización en línea.

El término es una palabra de moda, y es frecuentemente mal utilizado para referirse a cualquier forma de datos a gran escala o procesamiento de la información (recolección, extracción, almacenamiento, análisis y estadísticas), pero también se ha generalizado a cualquier tipo de sistema de apoyo informático decisión, incluyendo la inteligencia artificial, aprendizaje automático y la inteligencia empresarial. En el uso de la palabra, el término clave es el descubrimiento, comúnmente se define como “la detección de algo nuevo”. Incluso el popular libro “La minería de datos: sistema de prácticas herramientas de aprendizaje y técnicas con Java” (que cubre todo el material de aprendizaje automático) originalmente iba a ser llamado simplemente “la máquina de aprendizaje práctico”, y el término “minería de datos” se añadió por razones de marketing. A menudo, los términos más generales “(gran escala) el análisis de datos”, o “análisis” -. o cuando se refiere a los métodos actuales, la inteligencia artificial y aprendizaje automático, son más apropiados.



Tags: Minería de datos, Monográficos





Acosta Aguilera, M. E. "Minería de datos y descubrimiento del conocimiento." Info: Congreso Internacional de Información vol. 5, n. (2004).  pp.: http://www.congreso-info.cu/UserFiles/File/Info/Info2004/Ponencias/056.pdf



            Se define la minería de datos como un conjunto de procedimientos, técnicas y algoritmos para extraer la relaciones y patrones, la información oculta en las bases de datos. Se establece la relación entre minería de datos y descubrimiento de conocimiento y se describen los pasos en este proceso de descubrimiento. Se abordan las tareas que abarca la minería de datos, las componentes básicas de sus modelos y las técnicas y métodos más usados. Se analizan algunos de los problemas o retos que aún debe enfrentar la minería de datos para su total difusión. Se describen algunas de sus aplicaciones. Se concluye que, independientemente de la complejidad de la herramienta usada, el empleo de técnicas de minería de datos redunda en beneficio para una organización con grandes bases de datos.





Alejandra, S., V.-C. Christian, et al. "Using data mining techniques for exploring learning object repositories." The Electronic Library vol. 29, n. 2 (2011).  pp. 162-180. http://dx.doi.org/10.1108/02640471111125140



            Purpose – This paper aims to show the results obtained from the data mining techniques application to learning objects (LO) metadata. Design/methodology/approach – A general review of the literature was carried out. The authors gathered and pre-processed the data, and then analyzed the results of data mining techniques applied upon the LO metadata. Findings – It is possible to extract new knowledge based on learning objects stored in repositories. For example it is possible to identify distinctive features and group learning objects according to them. Semantic relationships can also be found among the attributes that describe learning objects. Research limitations/implications – In the first section, four test repositories are included for case study. In the second section, the analysis is focused on the most complete repository from the pedagogical point of view. Originality/value – Many publications report results of analysis on repositories mainly focused on the number, evolution and growth of the learning objects. But, there is a shortage of research using data mining techniques oriented to extract new semantic knowledge based on learning objects metadata.





Ana, K., D. Vladan, et al. "Using data mining to improve digital library services." The Electronic Library vol. 28, n. 6 (2010).  pp. 829-843. http://dx.doi.org/10.1108/02640471011093525



            Purpose – This paper aims to propose a solution for recommending digital library services based on data mining techniques (clustering and predictive classification). Design/methodology/approach – Data mining techniques are used to recommend digital library services based on the user's profile and search history. First, similar users were clustered together, based on their profiles and search behavior. Then predictive classification for recommending appropriate services to them was used. It has been shown that users in the same cluster have a high probability of accepting similar services or their patterns. Findings – The results indicate that k-means clustering and Naive Bayes classification may be used to improve the accuracy of service recommendation. The overall accuracy is satisfying, while average accuracy depends on the specific service. The results were better for frequently occurring services. Research limitations/implications – Datasets were used from the KOBSON digital library. Only clustering and predictive classification was applied. If the correlation between the service and the institution were higher, it would have better accuracy. Originality/value – The paper applied different and efficient data mining techniques for clustering digital library users based on their profiles and their search behavior, i.e. users' interaction with library services, and obtain user patterns with respect to the library services they use. A digital library may apply this approach to offer appropriate services to new users more easily. The recommendations will be based on library items that similar users have already found useful.





Ananiadou, S. "The National Centre for Text Mining: a vision for the future." Ariadne vol., n. 53 (2007).  pp. np. http://www.ariadne.ac.uk/issue53/ananiadou/



            Describes the the National Centre for Text Mining (NaCTeM) and the main scientific challenges it helps to solve together with issues related to deployment, use and uptake of NaCTeM's text mining tools and services. NaCTeM has developed a variety of text mining tools and services that offer numerous benefits to a wide range of users. These range from considerable reductions in time and effort for finding and linking pertinent information from large scale textual resources, to customised solutions in semantic data analysis and knowledge management. Enhancing metadata is one of the important benefits of deploying text mining services. TerMine (TM), a service for automatic term recognition, is being used for subject classification, creation of taxonomies, controlled vocabularies, ontology building and Semantic Web activities. As NaCTeM enters into its second phase, the goal is to improve levels of collaboration with Semantic Grid and Digital Library initiatives and contributions to bridging the gap between the library world and the e-Science world through an improved facility for constructing metadata descriptions from textual descriptions via TM. Adapted from the source document.





Ananiadou, S., J. Chruszcz, et al. "The National Centre for Text Mining: Aims and Objectives." Ariadne vol., n. 42 (2005).  pp.: http://www.ariadne.ac.uk/issue42/ananiadou/



            . In this article we describe the role of the National Centre for Text Mining (NaCTeM). NaCTeM is operated by a consortium of three Universities: the University of Manchester which leads the consortium, the University of Liverpool and the University of Salford. The service activity is run by the National Centre for Dataset Services (MIMAS), based within Manchester Computing (MC). As part of previous and ongoing collaboration, NaCTeM involves, as self-funded partners, world-leading groups at San Diego Supercomputer Center (SDSC), the University of California at Berkeley (UCB), the University of Geneva and the University of Tokyo. NaCTeM’s initial focus is on bioscience and biomedical texts as there is an increasing need for bio-text mining and automated methods to search, access, extract, integrate and manage textual information from large-scale bio-resources. NaCTeM was established in Summer 2004 with funding from the Joint Information Systems Committee (JISC), the Biotechnology and Biological Sciences Research Council (BBSRC) and the Engineering and Physical Sciences Research Council (EPSRC), with the consortium itself investing almost the same amount as it received in funding.





Arakawa, Y., A. Kameda, et al. "Adding Twitter-specific features to stylistic features for classifying tweets by user type and number of retweets." Journal of the Association for Information Science and Technology vol. 65, n. 7 (2014).  pp. 1416-1423. http://dx.doi.org/10.1002/asi.23126



            Recently, Twitter has received much attention, both from the general public and researchers, as a new method of transmitting information. Among others, the number of retweets (RTs) and user types are the two important items of analysis for understanding the transmission of information on Twitter. To analyze this point, we applied text classification and feature extraction experiments using random forests machine learning with conventional stylistic and Twitter-specific features. We first collected tweets from 40 accounts with a high number of followers and created tweet texts from 28,756 tweets. We then conducted 15 types of classification experiments using a variety of combinations of features such as function words, speech terms, Twitter's descriptive grammar, and information roles. We deliberately observed the effects of features for classification performance. The results indicated that class classification per user indicated the best performance. Furthermore, we observed that certain features had a greater impact on classification. In the case of the experiments that assessed the level of RT quantity, information roles had an impact. In the case of user experiments, important features, such as the honorific postpositional particle and auxiliary verbs, such as “desu” and “masu,” had an impact. This research clarifies the features that are useful for categorizing tweets according to the number of RTs and user types.





Baeza Yates, R. "Tendencias en minería de datos de la Web." El Profesional de la Información vol. 18, n. 1 (2009).  pp. 5-10. http://elprofesionaldelainformacion.metapress.com/media/9fppwlwxwpp61xvrlh87/contributions/3/7/5/7/3757882252861334.pdf



            Panorámica general y tendencias de diferentes aspectos y aplicaciones de la minería de datos en internet, con referencia a la Web 2.0, el spam, análisis de búsquedas, redes sociales y la privacidad.; Overview and trends of different aspects and applications of data mining on the Internet, in relation to Web 2.0, spam, analysis of searches, social networks and privacy.





Baeza-Yates, R. "Excavando la web." El Profesional de la Información vol. 13, n. 1 (2004).  pp.: http://elprofesionaldelainformacion.metapress.com/(j50ltv55nlwsbvu11bndbgmb)/app/home/journal.asp?referrer=parent&backto=homemainpublications,1,1 ;



            La web es el fenómeno más importante de internet, demostrado por su crecimiento exponencial y su diversidad. Por su volumen y riqueza de datos, los buscadores de páginas se han convertido en una de las herramientas principales. Son útiles cuando sabemos qué buscar. Sin embargo, es seguro que la web tiene muchas respuestas a preguntas nunca imaginadas. El proceso de descubrir relaciones o patrones interesantes en un conjunto de datos se llama minería de datos (del inglés data mining) y en el caso de la web se llama minería de la web (web mining). En este artículo presentamos las ideas más importantes en minería de la web y algunas de sus aplicaciones.





Baeza-Yates, R., C. Hurtado, et al. "Improving search engines by query clustering." Journal of the American Society for Information Science and Technology vol. 58, n. 12 (2007).  pp.: http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/



            In this paper, we present a framework for clustering Web search engine queries whose aim is to identify groups of queries used to search for similar information on the Web. The framework is based on a novel term vector model of queries that integrates user selections and the content of selected documents extracted from the logs of a search engine. The query representation obtained allows us to treat query clustering similarly to standard document clustering. We study the application of the clustering framework to two problems: relevance ranking boosting and query recommendation. Finally, we evaluate with experiments the effectiveness of our approach.





Bekhuis, T. "Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy." Biomedical Digital Libraries vol. 3, n. 2 (2006).  pp.: http://www.bio-diglib.com/content/pdf/1742-5581-3-2.pdf



            Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.





Bengtson, J. "Why I Can't Love the Homemade Semantic Web." B Sides vol., n. (2010).  pp.: http://ir.uiowa.edu/bsides/20



            Almost all information professionals agree that the web needs to move to a semantic structure. While work is proceeding in this area, movements to get individual web authors to use semantic markup tools have also been on the rise. This author argues that such efforts are ill conceived and he proposes an automated alternative.





Benolt, G. "Data Mining." Annual Review of Information Science and Technology (ARIST) vol. 36, n. (2002).  pp.: http://www3.interscience.wiley.com/cgi-bin/jissue/109883774



            Data mining (DM) is a multistaged process of extracting previously unanticipated knowledge from large databases, and applying the results to decision making. Data mining tools detect patterns from the data and infer associations and rules from them. The extracted information may then be applied to prediction or classification models by identifying relations within fue data records or between databases. Those patterns and rules can then guide decision making and forecast the effects of those decisions. However, this definition may be applied equally to 'knowledge discovery in databases' (KDD). lndeed, in the recent literature of DM and KDD, a source of confusion has emerged, making it difficult to determine the exact parameters ofboth. KDD is sometimes viewed as the broader discipline, of which data mining is merely a component-specifically pattern extraction, evaluation, and cleansing methods (Raghavan, Deogun, & Sever, 1998, p. 397). Thurasingham (1999, p. 2) remarked that 'knowledge discovery,' 'pat- tern discovery,' 'data dredging,' 'information extraction,' and 'knowledge mining' are all employed as synonyms for DM. Trybula, in hisARIST chapter on text mining, observed that the 'existing work [in KDD] is confusing because the terminology is inconsistent and poorly defined.





Blake, C. "Text Mining." Annual Review of Information Science and Technology (ARIST) vol. 45, n. (2011).  pp. 123-156.

            ARIST, published annually since 1966, is a landmark publication within the information science community. It surveys the landscape of information science and technology, providing an analytical, authoritative, and accessible overview of recent trends and significant developments. The range of topics varies considerably, reflecting the dynamism of the discipline and the diversity of theoretical and applied perspectives. While ARIST continues to cover key topics associated with "classical" information science (e.g., bibliometrics, information retrieval), the editor has selectively expanded its footprint in an effort to connect information science more tightly with cognate academic and professional communities.





Candás Romero, J. "Minería de datos en bibliotecas: bibliominería." BiD: textos universitaris de biblioteconomia i documentació vol., n. 17 (2006).  pp.: http://www2.ub.edu/bid/consulta_articulos.php?fichero=17canda2.htm



            Se presenta una introducción teórica a la aplicación de la minería de datos en bibliotecas, denominada bibliominería (propuesta terminológica en español para el inglés bibliomining). Asimismo, se presentan algunas de las posibles aplicaciones prácticas y cómo éstas sirven de apoyo a la llamada Biblioteca 2.0 y a la creación y gestión de servicios más y mejor orientados al usuario, basados en nuevas tecnologías. Finalmente se analiza el problema de la privacidad en la aplicación de la bibliominería.





Capuano, E. A. "O poder cognitivo das redes neurais artificiais modelo ART1 na recuperação da informação." Ciência da informaçao vol. 38, n. 1 (2009).  pp. 9-30. http://www.scielo.br/pdf/ci/v38n1/01.pdf



            This article reports an experiment with a computational simulation of an Information Retrieval System constituted of a textual indexing base from a sample of documents, an artificial neural network software implementing Adaptive Resonance Theory concepts for the process of ordering and presenting outputs, and a human user interacting with the system in query processing. The goal of the experiment was to demonstrate (i) the usefulness of Carpenter and Grossberg (1988) neural networks based on that theory, and (ii) the power of semantic resolution based on sintagmatic indexing of the SiRILiCO approach proposed by Gottschalg-Duque (2005), for whom a noun phrase or proposition is a linguistic unity constituted of meaning larger than a word meaning and smaller than a story telling or a theory meaning. The experiment demonstrated the effectiveness and efficiency of an Information Retrieval System joining together those resources, and the conclusion is that such computational environment will be capable of dynamic and on-line clustering with continuing inputs and learning in a non-supervised fashion, without batch training needs (off-line), to answer user queries in computer networks with promising performance. Adapted from the source document.





Chan-Chine, C. and C. Ruey-Shun "Using data mining technology to solve classification problems: A case study of campus digital library." The Electronic Library vol. 24, n. 3 (2006).  pp.: http://www.emeraldinsight.com/Insight/ViewContentServlet?Filename=Published/EmeraldFullTextArticle/Articles/2630240303.html



            Traditional library catalogs have become inefficient and inconvenient in assisting library users. Readers may spend a lot of time searching library materials via printed catalogs. Readers need an intelligent and innovative solution to overcome this problem. The paper seeks to examine data mining technology which is a good approach to fulfill readers' requirements. Design/methodology/approach – Data mining is considered to be the non-trivial extraction of implicit, previously unknown, and potentially useful information from data. This paper analyzes readers' borrowing records using the techniques of data analysis, building a data warehouse, and data mining. Findings – The paper finds that after mining data, readers can be classified into different groups according to the publications in which they are interested. Some people on the campus also have a greater prefeence for multimedia data. Originality/value – The data mining results shows that all readers can be categorized into five clusters, and each cluster has its own characteristics. The frequency with which graduates and associate researchers borrow multimedia data is much higher. This phenomenon shows that these readers have a higher preference for accepting digitized publications. Also, the number of readers borrowing multimedia data has increased over the years. This trend indicates that readers preferences are gradually shifting towards reading digital publications.





Chaves Ramos, H. d. S. and M. Brascher "Aplicação da descoberta de conhecimento em textos para apoio à construção de indicadores infométricos para a área de C&T." Ciência da informaçao vol. 38, n. 2 (2009).  pp. 56-68. http://www.scielo.br/pdf/ci/v38n2/05.pdf



            This article describes the results of a research applying Knowledge Discovery in Texts (KDT) in textual contents, which are important sources of information for decision-making purposes. The main objective of the research is to verify the effectiveness of KDT for discovering information that may support the construction of ST&I indicators and for the definition of public policies. The case study of the research was the textual content of the Brazilian Service for Technical Answers (Servico Brasileiro de Respostas Tecnicas -- SBRT) and the technique adopted was document clustering from terms mined in the database. The use of DCT for extracting hidden information -- that could not be found by using the traditional information retrieval -- from textual documents proved to be efficient. The presence of environmental concerns in the demand posted by SBRT's users and the applicability of DCT to orient internal policies for SBRT network were also evidenced by the research results. Adapted from the source document.





Chen, C.-L., F. S. C. Tseng, et al. "Mining fuzzy frequent itemsets for hierarchical document clustering." Information Processing & Management vol. 46, n. 2 (2010).  pp. 193-211. http://www.sciencedirect.com/science/article/B6VC8-4XK9J7J-1/2/733c0f885a05224f80d3f0ac97148e41



            As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.





Chen, Y.-L., Y.-H. Liu, et al. "A text mining approach to assist the general public in the retrieval of legal documents." Journal of the American Society for Information Science and Technology vol. 64, n. 2 (2013).  pp. 280-290. http://dx.doi.org/10.1002/asi.22767



            Applying text mining techniques to legal issues has been an emerging research topic in recent years. Although some previous studies focused on assisting professionals in the retrieval of related legal documents, they did not take into account the general public and their difficulty in describing legal problems in professional legal terms. Because this problem has not been addressed by previous research, this study aims to design a text-mining-based method that allows the general public to use everyday vocabulary to search for and retrieve criminal judgments. The experimental results indicate that our method can help the general public, who are not familiar with professional legal terms, to acquire relevant criminal judgments more accurately and effectively.





Clare, T. "Advances in Information Retrieval." Journal of Documentation vol. 68, n. 5 (2012).  pp.: http://www.emeraldinsight.com/journals.htm?articleid=17051071



            This publication provides an excellent “state of the art” review and description of recent developments and improvements within information retrieval (IR). It is very broad ranging in its coverage and the contributions are organised under the following headings: natural language processing (NLP) and text mining; web IR; evaluation; multi media IR; distributed IR and performance issues; IR theory and formal models; personalisation and recommendation; domain specific IR and cross language IR; user issues. In addition to 44 revised full papers, plus the keynote address, it includes abstracts of invited talks on emerging issues including collaborative web searching, the impact of visualisation technology on NLP and developments in automatic image annotation for multimedia IR. A fuller description of these talks or a clear reference to relevant papers would have been useful. Finally it includes posters and a description of some demonstrations of new IR systems.





Cobo, A., R. Rocha, et al. "Gestão da informação em ambientes globais: computação bio-inspirada em repositórios de documentos econômicos multilingues." Informação & Sociedade: Estudos vol. 23, n. 1 (2013).  pp.: http://periodicos.ufpb.br/ojs/index.php/ies/article/view/15128



            The information is a strategic resource of first order for organizations, so it is essential to have methodologies and tools that allow them to properly manage information and extract knowledge from it. Organizations also need knowledge generation strategies using unstructured textual information from different sources and in different languages. This paper presents two bio-inspired approaches to clustering multilingual document collections in a particular field (economics and business). This problem is quite significant and necessary to organize the huge volume of information managed within organisations in a global context characterised by the intensive use of Information and Communication Technologies. The proposed clustering algorithms take inspiration from the behaviour of real ant colonies and can be applied to identify groups of related multilingual documents in the field of economics and business. In order to obtain a language independent vector representation, several linguistic resources and tools are used. The performance of the algorithms is analysed using a corpus of 250 documents in Spanish and English from different functional areas of the enterprise, and experimental results are presented. The results demonstrate the usefulness and effectiveness of the algorithms as clustering technique.





Cobo Ortega, A., R. Rocha Blanco, et al. "Descubrimiento de conocimiento en repositorios documentales mediante técnicas de Minería de Texto y Swarm Intelligence." Rect@ : Revista Electrónica de Comunicaciones y Trabajos de ASEPUMA vol., n. 10 (2009).  pp. 105-124. http://dialnet.unirioja.es/servlet/extart?codigo=3267050



            El uso combinado de metodologías de minería de texto y técnicas de Inteligencia Artificial favorece los procesos de gestión documental y optimiza los mecanismos de categorización, extracción automática de conocimiento y agrupamiento de colecciones documentales. En el trabajo se propone un modelo de gestión documental integral para el proceso de información no estructurada. Se utilizan glosarios y tesauros especializados para establecer relaciones semánticas entre los términos, y técnicas de Swarm Intelligence para la extracción del conocimiento. El modelo ha sido implementado en una aplicación de uso intuitivo, multilingüe e integradora de técnicas de minería de texto

The combined use of text mining methodologies and Artificial Intelligence techniques articulate document management processes to optimize categorization mechanisms, automatic knowledge extraction and grouping document collections. The article proposed an integral document management model to process unstructured information.In this context, semantic relations in document collections are implemented by specialized thesaurus and glossaries, and knowledge feature extraction are facilitated by Swarm Intelligence techniques. The model has been implemented in an intuitive, integral and multilingual text mining application





Cui, H. "Competency evaluation of plant character ontologies against domain literature." Journal of the American Society for Information Science and Technology vol. 61, n. 6 (2010).  pp. n/a-n/a-n/a-n/a. http://doi.wiley.com/10.1002/asi.21325



            Specimen identification keys are still the most commonly created tools used by systematic biologists to access biodiversity information. Creating identification keys requires analyzing and synthesizing large amounts of information from specimens and their descriptions and is a very labor-intensive and time-consuming activity. Automating the generation of identification keys from text descriptions becomes a highly attractive text mining application in the biodiversity domain. Fine-grained semantic annotation of morphological descriptions of organisms is a necessary first step in generating keys from text. Machine-readable ontologies are needed in this process because most biological characters are only implied (i.e., not stated) in descriptions. The immediate question to ask is How well do existing ontologies support semantic annotation and automated key generation? With the intention to either select an existing ontology or develop a unified ontology based on existing ones, this paper evaluates the coverage, semantic consistency, and inter-ontology agreement of a biodiversity character ontology and three plant glossaries that may be turned into ontologies. The coverage and semantic consistency of the ontology/glossaries are checked against the authoritative domain literature, namely, Flora of North America and Flora of China. The evaluation results suggest that more work is needed to improve the coverage and interoperability of the ontology/glossaries. More concepts need to be added to the ontology/glossaries and careful work is needed to improve the semantic consistency. The method used in this paper to evaluate the ontology/glossaries can be used to propose new candidate concepts from the domain literature and suggest appropriate definitions.





de la Puente, M. "Gestión del conocimiento y minería de datos." E-LIS: E-Prints in Library and Information Science vol., n. (2010).  pp.: http://www.ccinfo.com.ar/documentos_trabajo/DT_019.pdf



            La Gestión del Conocimiento se refiere al conjunto de procesos desarrollados en una organización para crear, organizar, almacenar y transferir el conocimiento. La Minería de Datos es la disciplina que tiene por objetivo la extracción de conocimiento implícito en grandes bases de datos. La Minería de Datos tiene un papel fundamental en el proceso de convertir en explicito al conocimiento implícito y en las distintas etapas del proceso de Gestión del Conocimiento en las organizaciones





Del-Fresno-García, M. "Infosociabilidad: monitorización e investigación en la web 2.0 para la toma de decisiones." El Profesional de la Información vol. 20, n. 5 (2011).  pp. 548 - 554. http://eprints.rclis.org/bitstream/10760/16150/1/Miguel-Del-Fresno-Infosociabilidad-reputacion-Online.pdf



            This methodology offers an approach to studying the information available within Web 2.0 Media and User-Generated Content (MUGC). The large-scale generation of online information is the result of collective social action based on information: Infosociability. Competitive Intelligence (CI) aims to monitor and research a company’s web 2.0 environment for information relevant to its decision-making process. Facing the possibilities and limitations that today’s technology offers for processing the communication of meanings and abstract ideas in text format, a methodology derived from empirical research on web 2.0 is proposed. Monitoring and research are identified as the two key processes that generate insights aimed to facilitate decision-making. The relevance of each stage is illustrated with reference to the diverse methodological challenges encountered while extracting and analyzing large amounts of online information.





Eíto Brun, R. and J. A. Senso "Minería textual." El Profesional de la Información vol. 13, n. 1 (2004).  pp.: http://www.elprofesionaldelainformacion.com/contenidos/2004/enero/2.pdf



            This article attempts to establish a definition for 'text mining' and, at the same time, to identify its relationship with other fields: text retrieval, data mining and computational linguistics. In addition, there is an analysis of the impact of text mining, a reference to existing commercial applications on the market and, lastly, a brief description of the techniques used for developing and implementing text mining systems.





Eric Lease, M. "Use and understand: the inclusion of services against texts in library catalogs and “discovery systems”." Library Hi Tech vol. 30, n. 1 (2012).  pp. 35-59. http://dx.doi.org/10.1108/07378831211213201



            Purpose – The purpose of this article is to outline possibilities for the integration of text mining and other digital humanities computing techniques into library catalogs and “discovery systems”. Design/methodology/approach – The approach has been to survey existing text mining apparatus and apply this to traditional library systems. Findings – Through this process it is found that there are many ways library interfaces can be augmented to go beyond the processes of find and get and evolve to include processes of use and understand. Originality/value – To the best of the author's knowledge, this type of augmentation has yet to be suggested or implemented across libraries.





Escorsa, P. and R. Maspons "Los mapas tecnológicos." De la vigilancia tecnológica a la inteligencia competitiva vol., n. 5 (2001).  pp.: http://148.216.10.83/VIGILANCIA/capitulo_5.htm



            Bajo el título genérico de Mapas Tecnológicos se reúnen en este capítulo algunos temas diversos muy importantes para la inteligencia empresarial, tales como la elaboración de los mapas propiamente dichos o la relación entre los productos y/o las tecnologías con los mercados, para lo que se propone una matriz que ayude a descubrir oportunidades, aunque es preciso advertir que este método se halla todavía en una fase muy incipiente. Por último se introduce brevemente la minería de datos (data mining), de uso creciente en las empresas, a pesar de que utiliza técnicas como las redes neuronales o los árboles de decisión, no descritas en el capítulo anterior.





Febles Rodríguez, J. P. and A. González Pérez "Aplicación de la minería de datos en la bioinformática." Acimed: revista cubana de los profesionales de la información y la comunicación en salud vol. 10, n. 2 (2002).  pp.: http://bvs.sld.cu/revistas/aci/vol10_2_02/aci03202.htm



            En los próximos años ocurrirá un avance espectacular de las ciencias biomédicas como resultado del proyecto Genoma Humano. Las nuevas tecnologías, basadas en la genética molecular y la informática, son claves para este desarrollo, pues ellas suministran potentes instrumentos para la obtención y el análisis de la información genética. La aparición de nuevas tecnologías ha posibilitado el desarrollo de la genómica, al facilitar el estudio de las interacciones de los genes y su influencia en el desarrollo de enfermedades, todo lo cual influye en el diagnóstico clínico, la investigación de nuevos fármacos, la epidemiología y la informática médica. En los últimos años, la minería de datos (data mining) ha experimentado un auge como soporte para las filosofías de la gestión de la información y el conocimiento, así como para el descubrimiento del significado que poseen los datos almacenados en grandes bancos.





Firestone, J. M. "Mining for information gold." Information Management Journal vol. 39, n. 5 (2005).  pp. 47-50, 52.

            Discusses the concept of data mining and its value for records and information management (RIM) professionals in enhancing the quality of information. Shows how to get started in data mining and considers some of the concerns about the technique, including assuring the quality of the data mined in terms of currency, completeness and accuracy. Attempts to predict the future for the technique and suggests that this might lie in the direction of innovation currently being undertaken at university laboratories, the increasing popularity of open analytical platforms from vendors, the integration of business intelligence and data mining technologies, and the continuing development of intelligent agents and distributed knowledge processing for processing the mass of information becoming available. (Quotes from original text)





Fox, L. M., L. A. Williams, et al. "Negotiating a Text Mining License for Faculty Researchers." Information Technology and Libraries vol. 33, n. 3 (2014).  pp. 5-21. http://ejournals.bc.edu/ojs/index.php/ital/article/view/5485



            This case study examines strategies used to leverage the library’s existing journal licenses to obtain a large collection of full-text journal articles in extensible markup language (XML) format; the right to text mine the collection; and the right to use the collection and the data mined from it for grant-funded research to develop biomedical natural language processing (BNLP) tools. Researchers attempted to obtain content directly from PubMed Central (PMC). This attempt failed due to limits on use of content in PMC. Next researchers and their library liaison attempted to obtain content from contacts in the technical divisions of the publishing industry. This resulted in an incomplete research data set. Then researchers, the library liaison, and the acquisitions librarian collaborated with the sales and technical staff of a major science, technology, engineering, and medical (STEM) publisher to successfully create a method for obtaining XML content as an extension of the library’s typical acquisition process for electronic resources. Our experience led us to realize that text mining rights of full-text articles in XML format should routinely be included in the negotiation of the library’s licenses.





Franganillo, J. "Implicaciones éticas de la minería de datos." Anuario ThinkEPI vol., n. (2010).  pp.: http://www.thinkepi.net/implicaciones-eticas-de-la-mineria-de-datos



            Ciertos expertos pueden describir la conducta de un conjunto de personas basándose en los registros digitales de lo que hacen. La descripción es detallada: qué hacen, qué compran, cómo trabajan, con quién se relacionan. Es la minería de datos, que suele usarse para discriminar en positivo: al saber, por ejemplo, qué hábitos de compra tiene un determinado colectivo, es posible orientarles más efectivamente una campaña publicitaria. Pero también puede usarse para discriminar en negativo: el análisis del registro del correo electrónico de los empleados de una empresa permite identificar a quienes están alimentando redes informales y, en consecuencia, los directivos podrían cambiar la actitud hacia aquéllos. Un estudio observa que quienes compran coches rojos en Francia son más propensos a incumplir el pago de los créditos (Chakrabarti, 2008): esto podría modificar las condiciones crediticias de quienes escogen el rojo para el coche. Suele clasificarse a las personas según estereotipos que se basan en correlaciones estadísticas, pero éstas implican los errores de toda generalización, y así pagan unos por otros





Fu Lee, W. and C. C. Yang "Mining Web data for Chinese segmentation." Journal of the American Society for Information Science and Technology vol. 58, n. 12 (2007).  pp.: http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/



            Modern information retrieval systems use keywords within documents as indexing terms for search of relevant documents. As Chinese is an ideographic character-based language, the words in the texts are not delimited by white spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although most search engines have problems in segmenting texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining Web data with the help of search engines. On the other hand, the Romanized pinyin of Chinese language indicates boundaries of words in the text. Our algorithm is the first to utilize the Romanized pinyin to segmentation. It is the first unified segmentation algorithm for the Chinese language from different geographical areas, and it is also domain independent because of the nature of the Web. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problems of segmentation ambiguity, new word (unknown word) detection, and stop words.





Fu, T., A. Abbasi, et al. "A focused crawler for Dark Web forums." Journal of the American Society for Information Science and Technology vol. 61, n. 6 (2010).  pp. 1213 - 1231. http://doi.wiley.com/10.1002/asi.21323



            The unprecedented growth of the Internet has given rise to the Dark Web, the problematic facet of the Web associated with cybercrime, hate, and extremism. Despite the need for tools to collect and analyze Dark Web forums, the covert nature of this part of the Internet makes traditional Web crawling techniques insufficient for capturing such content. In this study, we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums. Several URL ordering features and techniques enable efficient extraction of forum postings. The system also includes an incremental crawler coupled with a recall-improvement mechanism intended to facilitate enhanced retrieval and updating of collected content. Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall-improvement-based, incremental-update procedure yielded favorable results. The human-assisted approach significantly improved access to Dark Web forums while the incremental crawler with recall improvement also outperformed standard periodic- and incremental-update approaches. Using the system, we were able to collect over 100 Dark Web forums from three regions. A case study encompassing link and content analysis of collected forums was used to illustrate the value and importance of gathering and analyzing content from such online communities.





Gálvez, C. "Minería de textos: la nueva generación de análisis de literatura científica en biología molecular y genómica." Departamento de Ciência da Informaç¦o, Universidade Federal de Santa Catarina (Brasil) vol., n. (2008).  pp.: http://eprints.rclis.org/13361/



            Una vez descifrado la secuencia del genoma humano, el paradigma de investigación ha cambiado dando paso a la descripción de las funciones de los genes y a futuros avances en la lucha contra enfermedades. Este nuevo contexto ha despertado el interés de la Bioinformática, que combina métodos de las Ciencias de la Vida con las Ciencias de la Información haciendo posible el acceso a la gran cantidad de información biológica almacenada en las bases de datos, y de la Genómica, dedicada al estudio de las interacciones de los genes y su influencia en el desarrollo de enfermedades. En este contexto, la minería de textos surge como un instrumento emergente para el análisis de la literatura científica. Una tarea habitual de la minería de textos en Biología Molecular y Genómica es el reconocimiento de entidades biológicas, tales como genes, proteínas y enfermedades. El paso siguiente en el proceso de minería lo constituye la identificación entre entidades biológicas, tales como el tipo de interacción entre gengen, gen-enfermedad, gen-proteína, para interpretar funciones biológicas, o formular hipótesis de investigación. El objetivo de este trabajo es examinar el auge y las limitaciones la nueva generación de herramientas de análisis de la información en lenguaje natural, almacenada en bases de datos bibliográficas, como PubMed o MEDLINE.





Gálvez, C. and F. Moya-Anegón "Text-mining research in genomics." International Association for Development of the Information Society (IADIS) vol., n. (2008).  pp. 277-283. http://www.computing-conf.org/



            Biomedical text-mining have great promise to improve the usefulness of genomic researchers. The goal of text-mining is analyzed large collections of unstructured documents for the purposes of extracting interesting and non-trivial patterns of knowledge. The analysis of biomedical texts and available databases, such as Medline and PubMed, can help to interpret a phenomenon, to detect gene relations, or to establish comparisons among similar genes in different specific databases. All these processes are crucial for making sense of the immense quantity of genomic information. In genomics, text-mining research refers basically to the creation of literature networks of related biological entities. Text data represent the genomics knowledge base and can be mined for relationships, literature networks, and new discoveries by literature relational chaining. However, text-mining is an emerging field without a clear definition in the genomics. This work presents some applications of text-mining to genome-based research, such as the genomic term identification in curation processes, the formulation of hypotheses about disease, the visualization of biological relationships, or the life-science domain mapping.





Gómez Aguilar, D. A., F. J. García Peñalvo, et al. "Analítica visual en e-learning." Visual analytics in e-learning vol. 23, n. 3 (2014).  pp. 236-245. http://elprofesionaldelainformacion.metapress.com/app/home/contribution.asp?referrer=parent&backto=issue,3,13;journal,2,96;homemainpublications,1,1 ;



            Las tecnologías utilizadas en los procesos de aprendizaje implican el registro de todas las actividades realizadas. Estos datos se pueden aprovechar para la evaluación de estudiantes, profesores y de los propios procesos. Sin embargo, aunque existe esta gran cantidad de datos, sigue siendo difícil para los profesores (y otras partes interesadas) verificar hipótesis, extraer conclusiones o tomar decisiones basadas en hechos o situaciones detectadas. Se presenta un modelo de análisis de datos educativos basado en analítica visual, analítica del aprendizaje y analítica académica. Por medio de una herramienta de software permite realizar análisis de datos exploratorios y confirmatorios, en interacción con la información obtenida de un sistema típico de gestión de aprendizaje. El objetivo principal es el descubrimiento de nuevo conocimiento sobre el proceso de aprendizaje educativo que, a su vez, posibilite la mejora de éste. (A.)





Haravu, L. J. and A. Neelameghan "Text Mining and Data Mining in Knowledge Organization and Discovery: The Making of Knowledge-Based Products." Cataloging & classification quarterly vol. 37, n. 1-2 (2003).  pp.: https://www.haworthpress.com/store/ArticleAbstract.asp?sid=82ATF0VJW1QK8MGFETSH163PXESHFAM9&ID=40765



            Discusses the importance of knowledge organization in the context of the information overload caused by the vast quantities of data and information accessible on internal and external networks of an organization. Defines the characteristics of a knowledge-based product. Elaborates on the techniques and applications of text mining in developing knowledge products. Presents two approaches, as case studies, to the making of knowledge products: (1) steps and processes in the planning, designing and development of a composite multilingual multimedia CD product, with the potential international, inter-cultural end users in view, and (2) application of natural language processing software in text mining. Using a text mining software, it is possible to link concept terms from a processed text to a related thesaurus, glossary, schedules of a classification scheme, and facet structured subject representations. Concludes that the products of text mining and data mining could be made more useful if the features of a faceted scheme for subject classification are incorporated into text mining techniques and products.





He, Y. L. and S. C. Hui "Mining a Web Citation Database for Author Co-Citation Analysis." Information Processing & Management vol. 38, n. 4 (2002).  pp.: http://www.sciencedirect.com/science/journal/03064573



            Author co-citation analysis (ACA) has been widely used in bibliometrics as an analytical method in analyzing the intellectual structure of science studies. It can be used to identify authors from the same or similar research fields. However, such analysis method relies heavily on statistical tools to perform the analysis and requires human interpretation, Web Citation Database is a data warehouse used for storing citation indices of Web publications. In this paper. we propose a mining process to automate the ACA based on the Web Citation Database. The mining process uses agglomerative hierarchical clustering (AHC) as the mining technique for author clustering and multidimensional scaling (MDS) for displaying author cluster maps. The clustering results and author cluster map have been incorporated into a citation-based retrieval system known as PubSearch to support author retrieval of Web publications.





Heinrichs, J. H. and J.-S. Lim "Integrating web-based data mining tools with business models for knowledge management." Decision Support Systems vol. 35, n. 1 (2003).  pp. 103-112. http://www.sciencedirect.com/science/article/B6V8S-45X0BS7-3/2/2134b5fcc7cf3c9d9ac56149e96489ef



            As firms begin to implement web-based presentation and data mining tools to enhance decision support capability, the firm's knowledge workers must determine how to most effectively use these new web-based tools to deliver competitive advantage. The focus of this study is on evaluating how knowledge workers integrate these tools into their information and knowledge management requirements. The relationship between the independent variables (web-based data mining software tools and business models) and the dependent variable (strategic performance capabilities) is empirically tested in this study. The results from this study demonstrate the positive interaction effect between the tools and models application on strategic performance capability.





Heneberg, P. "Supposedly uncited articles of Nobel laureates and Fields medalists can be prevalently attributed to the errors of omission and commission." Journal of the American Society for Information Science and Technology vol. 64, n. 3 (2013).  pp. 448-454. http://dx.doi.org/10.1002/asi.22788



            Several independent authors reported a high share of uncited publications, which include those produced by top scientists. This share was repeatedly reported to exceed 10% of the total papers produced, without any explanation of this phenomenon and the lack of difference in uncitedness between average and successful researchers. In this report, we analyze the uncitedness among two independent groups of highly visible scientists (mathematicians represented by Fields medalists, and researchers in physiology or medicine represented by Nobel Prize laureates in the respective field). Analysis of both groups led to the identical conclusion: over 90% of the uncited database records of highly visible scientists can be explained by the inclusion of editorial materials progress reports presented at international meetings (meeting abstracts), discussion items (letters to the editor, discussion), personalia (biographic items), and by errors of omission and commission of the Web of Science (WoS) database and of the citing documents. Only a marginal amount of original articles and reviews were found to be uncited (0.9 and 0.3%, respectively), which is in strong contrast with the previously reported data, which never addressed the document types among the uncited records.





Hsinchun, C. "Introduction to the JASIST Special Topic Section on Web Retrieval and Mining A Machine Learning Perspective." Journal of the American Society for Information Science and Technology vol. 54, n. 7 (2002).  pp.:

            This special issue consists of six papers that report research in web retrieval and mining. Most papers apply or adapt various pre-web retrieval and analysis techniques to other interesting and challenging web-based applications.





Hsinchun Chen, M. C. "Web mining: machine learning for web applications." Annual Review of Information Science and Technology (ARIST) vol. 38, n. (2004).  pp.: http://www3.interscience.wiley.com/cgi-bin/fulltext/111091572/PDFSTART



            With more than two billion pages created by millions of Web page authors and organizations, the World Wide Web is a tremendously rich knowledge base. The knowledge comes not only from the content of the pages themselves, but also from the unique characteristics of the Web, such as its hyperlink structure and its diversity of content and languages. Analysis of these characteristics often reveals interesting patterns and new knowledge. Such knowledge can be used to improve users' efficiency and effectiveness in searching for information on the Web, and also for applications unrelated to the Web, such as support for decision making or business management.





Huang, C., T. Fu, et al. "Text-based video content classification for online video-sharing sites." Journal of the American Society for Information Science and Technology vol. 61, n. 5 (2010).  pp. 891-906. http://dx.doi.org/10.1002/asi.21291



            With the emergence of Web 2.0, sharing personal content, communicating ideas, and interacting with other online users in Web 2.0 communities have become daily routines for online users. User-generated data from Web 2.0 sites provide rich personal information (e.g., personal preferences and interests) and can be utilized to obtain insight about cyber communities and their social networks. Many studies have focused on leveraging user-generated information to analyze blogs and forums, but few studies have applied this approach to video-sharing Web sites. In this study, we propose a text-based framework for video content classification of online-video sharing Web sites. Different types of user-generated data (e.g., titles, descriptions, and comments) were used as proxies for online videos, and three types of text features (lexical, syntactic, and content-specific features) were extracted. Three feature-based classification techniques (C4.5, Naïve Bayes, and Support Vector Machine) were used to classify videos. To evaluate the proposed framework, user-generated data from candidate videos, which were identified by searching user-given keywords on YouTube, were first collected. Then, a subset of the collected data was randomly selected and manually tagged by users as our experiment data. The experimental results showed that the proposed approach was able to classify online videos based on users' interests with accuracy rates up to 87.2%, and all three types of text features contributed to discriminating videos. Support Vector Machine outperformed C4.5 and Naïve Bayes techniques in our experiments. In addition, our case study further demonstrated that accurate video-classification results are very useful for identifying implicit cyber communities on video-sharing Web sites.R.





Hwang, S.-Y., W.-S. Yang, et al. "Automatic index construction for multimedia digital libraries." Information Processing & Management vol. 46, n. 3 (2010).  pp. 295-307. http://www.sciencedirect.com/science/article/B6VC8-4XM6NHT-2/2/50b0f024e70987516fe1fba5a5637955



            Indexing remains one of the most popular tools provided by digital libraries to help users identify and understand the characteristics of the information they need. Despite extensive studies of the problem of automatic index construction for text-based digital libraries, the construction of multimedia digital libraries continues to represent a challenge, because multimedia objects usually lack sufficient text information to ensure reliable index learning. This research attempts to tackle the problem of automatic index construction for multimedia objects by employing Web usage logs and limited keywords pertaining to multimedia objects. The tests of two proposed algorithms use two different data sets with different amounts of textual information. Web usage logs offer precious information for building indexes of multimedia digital libraries with limited textual information. The proposed methods generally yield better indexes, especially for the artwork data set.





Jia, L., Z. Pengzhu, et al. "External concept support for group support systems through Web mining." Journal of the American Society for Information Science and Technology vol. 60, n. 5 (2009).  pp. 1057. http://proquest.umi.com/pqdweb?did=1682801381&Fmt=7&clientId=40776&RQT=309&VName=PQD



            External information plays an important role in group decision-making processes, yet research about external information support for Group Support Systems (GSS) has been lacking. In this study, we propose an approach to build a concept space to provide external concept support for GSS users. Built on a Web mining algorithm, the approach can mine a concept space from the Web and retrieve related concepts from the concept space based on users' comments in a real-time manner. We conduct two experiments to evaluate the quality of the proposed approach and the effectiveness of the external concept support provided by this approach. The experiment results indicate that the concept space mined from the Web contained qualified concepts to stimulate divergent thinking. The results also demonstrate that external concept support in GSS greatly enhanced group productivity for idea generation tasks.





Jiang, X. and A.-H. Tan "CRCTOL: A semantic-based domain ontology learning system." Journal of the American Society for Information Science and Technology vol., n. (2009).  pp.: http://dx.doi.org/10.1002%2Fasi.21231



            Domain ontologies play an important role in supporting knowledge-based applications in the Semantic Web. To facilitate the building of ontologies, text mining techniques have been used to perform ontology learning from texts. However, traditional systems employ shallow natural language processing techniques and focus only on concept and taxonomic relation extraction. In this paper we present a system, known as Concept-Relation-Concept Tuple-based Ontology Learning (CRCTOL), for mining ontologies automatically from domain-specific documents. Specifically, CRCTOL adopts a full text parsing technique and employs a combination of statistical and lexico-syntactic methods, including a statistical algorithm that extracts key concepts from a document collection, a word sense disambiguation algorithm that disambiguates words in the key concepts, a rule-based algorithm that extracts relations between the key concepts, and a modified generalized association rule mining algorithm that prunes unimportant relations for ontology learning. As a result, the ontologies learned by CRCTOL are more concise and contain a richer semantics in terms of the range and number of semantic relations compared with alternative systems. We present two case studies where CRCTOL is used to build a terrorism domain ontology and a sport event domain ontology. At the component level, quantitative evaluation by comparing with Text-To-Onto and its successor Text2Onto has shown that CRCTOL is able to extract concepts and semantic relations with a significantly higher level of accuracy. At the ontology level, the quality of the learned ontologies is evaluated by either employing a set of quantitative and qualitative methods including analyzing the graph structural property, comparison to WordNet, and expert rating, or directly comparing with a human-edited benchmark ontology, demonstrating the high quality of the ontologies learned.





Jiann-Cherng, S. "The integration system for librarians' bibliomining." The Electronic Library vol. 28, n. 5 (2010).  pp. 709-721. http://dx.doi.org/10.1108/02640471011081988



            Purpose – For library service, bibliomining is concisely defined as the data mining techniques used to extract patterns of behavior-based artifacts from library systems. The bibliomining process includes identifying topics, creating a data warehouse, refining data, exploring data and evaluating results. The cases of practical implementations and applications in different areas have proved that the properly enough and consolidated data warehouse is the critical promise to successful data mining applications. However, the data warehouse creation in the processing of various data sources obviously hampers librarians to apply bibliomining to improve their services and operations. Moreover, most market data mining tools are even more complex for librarians to adopt bibliomining. The purpose of this paper is to propose a practical application model for librarian bibliomining, then develop its corresponding data processing prototype system to guarantee the success of applying data mining in libraries. Design/methodology/approach – The rapid prototyping software development method was applied to design a prototype bibliomining system. In order to evaluate the effectiveness of the system, there was a comparison experiment of accomplishing an assigned task for 15 librarians. Findings – With the results of system usability scale (SUS) comparison and turn-around time analysis, it was established that the proposed model and the developed prototype system can really help librarians handle bibliomining applications better. Originality/value – The proposed novel application bibliomining model and its developed integration system are proved to be effective and efficient in bibliomining by the task-oriented experiment and SUS to 15 librarians. Comparing turn-around time to accomplish the assigned task, about 35 per cent in terms of time was saved. Librarians really require an appropriate integration tool to assist them in successful bibliomining applications.





Kai, G., W. Yong-Cheng, et al. "Similar interest clustering and partial back-propagation-based recommendation in digital library." Library Hi Tech vol. 23, n. 4 (2005).  pp.: http://www.emeraldinsight.com/10.1108/07378830510636364



            This purpose of this paper is to propose a recommendation approach for information retrieval. Design/methodology/approach – Relevant results are presented on the basis of a novel data structure named FPT-tree, which is used to get common interests. Then, data is trained by using a partial back-propagation neural network. The learning is guided by users' click behaviors. Findings – Experimental results have shown the effectiveness of the approach. Originality/value – The approach attempts to integrate metric of interests (e.g., click behavior, ranking) into the strategy of the recommendation system. Relevant results are first presented on the basis of a novel data structure named FPT-tree, and then, those results are trained through a partial back-propagation neural network. The learning is guided by users' click behaviors.





Kao, S. C., H. C. Changb, et al. "Decision support for the academic library acquisition budget allocation via circulation database mining." Information Processing & Management vol. 39, n. 1 (2003).  pp.: http://www.sciencedirect.com/science/journal/03064573



            Many approaches to decision support for the academic library acquisition budget allocation have been proposed to diversely reflect the management requirements. Different from these methods that focus mainly on either statistical analysis or goal programming, this paper introduces a model (ABAMDM, acquisition budget allocation model via data mining) that addresses the use of descriptive knowledge discovered in the historical circulation data explicitly to support allocating library acquisition budget. The major concern in this study is that the budget allocation should be able to reflect a requirement that the more a department makes use of its acquired materials in the present academic year, the more it can get budget for the coming year. The primary output of the ABAMDM used to derive weights of acquisition budget allocation contains two parts. One is the descriptive knowledge via utilization concentration and the other is the suitability via utilization connection for departments concerned. An application to the library of Kun Shan University of Technology was described to demonstrate the introduced ABAMDM in practice.





Kostoff, R. N. "Literature-Related Discovery." Annual Review of Information Science and Technology (ARIST) vol. 43, n. (2009).  pp. 241-287. http://onlinelibrary.wiley.com/doi/10.1002/aris.144.v43:1/issuetoc



            Literature-related discovery (LRD) is linking two or more literature concepts that have heretofore not been linked (i.e., disjoint), in order to produce novel, interesting, plausible, and intelligible knowledge. LRD has two components: Literature-based discovery (LBD) generates potential discovery through literature analysis alone, whereas literature-assisted discovery (LAD) generates potential discovery through a combination of literature analysis and interactions among selected literature authors. In turn, there are two types of LBD and LAD: open discovery systems (ODS), where one starts with a problem and arrives at a solution, and closed discovery systems (CDS), where one starts with a problem and a solution, then determines the mechanism(s) that links them.





Ku, L.-W., H.-W. Ho, et al. "Opinion mining and relationship discovery using CopeOpi opinion analysis system." Journal of the American Society for Information Science and Technology vol. 60, n. 7 (2009).  pp. 1486-1503. http://dx.doi.org/10.1002/asi.21067



            We present CopeOpi, an opinion-analysis system, which extracts from the Web opinions about specific targets, summarizes the polarity and strength of these opinions, and tracks opinion variations over time. Objects that yield similar opinion tendencies over a certain time period may be correlated due to the latent causal events. CopeOpi discovers relationships among objects based on their opinion-tracking plots and collocations. Event bursts are detected from the tracking plots, and the strength of opinion relationships is determined by the coverage of these plots. To evaluate opinion mining, we use the NTCIR corpus annotated with opinion information at sentence and document levels. CopeOpi achieves sentence- and document-level f-measures of 62% and 74%. For relationship discovery, we collected 1.3M economics-related documents from 93 Web sources over 22 months, and analyzed collocation-based, opinion-based, and hybrid models. We consider as correlated company pairs that demonstrate similar stock-price variations, and selected these as the gold standard for evaluation. Results show that opinion-based and collocation-based models complement each other, and that integrated models perform the best. The top 25, 50, and 100 pairs discovered achieve precision rates of 1, 0.92, and 0.79, respectively.





Lai, Y. and J. Zeng "A cross-language personalized recommendation model in digital libraries." Electronic Library, The vol. 31, n. 3 (2013).  pp. 264-277. http://dx.doi.org/10.1108/EL-08-2011-0126



            Purpose – The purpose of this paper is to develop a cross-language personalized recommendation model based on web log mining, which can recommend academic articles, in different languages, to users according to their demands. Design/methodology/approach – The proposed model takes advantage of web log data archived in digital libraries and learns user profiles by means of integration analysis of a user's multiple online behaviors. Moreover, keyword translation was carried out to eliminate language dissimilarity between user and item profiles. Finally, article recommendation can be achieved using various existing algorithms. Findings – The proposed model can recommend articles in different languages to users according to their demands, and the integration analysis of multiple online behaviors can help to better understand a user's interests. Practical implications – This study has a significant implication for digital libraries in non-English countries, since English is the most popular language in current academic articles and it is a very common phenomenon for users in these countries to obtain literatures presented by more than one language. Furthermore, this approach is also useful for other text-based item recommendation systems. Originality/value – A lot of research work has been done in the personalized recommendation area, but few works have discussed the recommendation problem under multiple linguistic circumstances. This paper deals with cross-language recommendation and, moreover, the proposed model puts forward an integration analysis method based on multiple online behaviors to understand users' interests, which can provide references for other recommendation systems in the digital age.





Lappas, G. "An overview of web mining in societal benefit areas." Online Information Review vol. 32, n. 2 (2008).  pp. 179-195. http://ejournals.ebsco.com/direct.asp?ArticleID=4946A99CCC163D629265



            Purpose - The focus of this paper is a survey of web-mining research related to areas of societal benefit. The article aims to focus particularly on web mining which may benefit societal areas by extracting new knowledge, providing support for decision making and empowering the effective management of societal issues. Design/methodology/approach - E-commerce and e-business are two fields that have been empowered by web mining, having many applications for increasing online sales and doing intelligent business. Have areas of social interest also been empowered by web mining applications? What are the current ongoing research and trends in e-services fields such as e-learning, e-government, e-politics and e-democracy? What other areas of social interest can benefit from web mining? This work will try to provide the answers by reviewing the literature for the applications and methods applied to the above fields. Findings - There is a growing interest in applications of web mining that are of social interest. This reveals that one of the current trends of web mining is toward the connection between intelligent web services and societal benefit applications, which denotes the need for interdisciplinary collaboration between researchers from various fields. Originality/value - On the one hand, this work presents to the web-mining community an overview of research opportunities in societal benefit areas. On the other hand, it presents to web researchers from various disciplines an approach for improving their web studies by considering web mining as a powerful research tool.





Liu, X., S. Yu, et al. "Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database." Journal of the American Society for Information Science and Technology vol. 61, n. 6 (2010).  pp. 1105 - 1119. http://doi.wiley.com/10.1002/asi.21312



            We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.





Liu, X., S. Yu, et al. "Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database." Journal of the American Society for Information Science and Technology vol. 61, n. 6 (2010).  pp. 1105-1119. http://dx.doi.org/10.1002/asi.21312



            Abstract 10.1002/asi.21312.abs We propose a new hybrid clustering framework to incorporate text mining with bibliometrics in journal set analysis. The framework integrates two different approaches: clustering ensemble and kernel-fusion clustering. To improve the flexibility and the efficiency of processing large-scale data, we propose an information-based weighting scheme to leverage the effect of multiple data sources in hybrid clustering. Three different algorithms are extended by the proposed weighting scheme and they are employed on a large journal set retrieved from the Web of Science (WoS) database. The clustering performance of the proposed algorithms is systematically evaluated using multiple evaluation methods, and they were cross-compared with alternative methods. Experimental results demonstrate that the proposed weighted hybrid clustering strategy is superior to other methods in clustering performance and efficiency. The proposed approach also provides a more refined structural mapping of journal sets, which is useful for monitoring and detecting new trends in different scientific fields.





Liwen, V., Y. Rongbin, et al. "Web co-word analysis for business intelligence in the Chinese environment." Aslib Proceedings vol. 64, n. 6 (2012).  pp. 653-667. http://dx.doi.org/10.1108/00012531211281788



            Purpose – The study seeks to apply Web co-word analysis to the Chinese business environment to test the feasibility of the method there. Design/methodology/approach – The authors selected a group of companies in two Chinese industries, collected co-word data for the companies, analyzed the data with multidimensional scaling (MDS), and then compared the MDS maps generated from the co-word data with business situations to find out if the co-word method works. Findings – The study found that the Web co-word method could potentially be applied to the Chinese environment. The study also found the advantages and disadvantages of the Web co-word method vs the Web co-link method. Originality/value – Knowing the applicability of the Web co-word method to the Chinese environment contributes to the knowledge of this new Webometrics method. Mining business information from the Web is more valuable when applied to a foreign country where language and culture barriers exist. To use the co-word method, one does not have to be able to read or write in that language. One only needs to have the names of the companies to study, which can be easily obtained without knowledge of the language. The value of business information about countries such as China is obvious given the global nature of contemporary business competition and the significance of the Chinese economy to the world.





Löfström, T. and U. Johansson "Predicting the Benefit of Rule Extraction A Novel Component in Data Mining." Human IT vol. 7, n. 3 (2002).  pp.: http://www.hb.se/bhs/ith/3-7/tluj.pdf



            Predicting the Benefit of Rule Extraction : A Novel Component in Data Mining





Lun-Wei, K. and C. Hsin-Hsi "Mining opinions from the Web: Beyond relevance retrieval." Journal of the American Society for Information Science and Technology vol. 58, n. 12 (2007).  pp.: http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/



            Documents discussing public affairs, common themes, interesting products, and so on, are reported and distributed on the Web. Positive and negative opinions embedded in documents are useful references and feedbacks for governments to improve their services, for companies to market their products, and for customers to purchase their objects. Web opinion mining aims to extract, summarize, and track various aspects of subjective information on the Web. Mining subjective information enables traditional information retrieval (IR) systems to retrieve more data from human viewpoints and provide information with finer granularity. Opinion extraction identifies opinion holders, extracts the relevant opinion sentences, and decides their polarities. Opinion summarization recognizes the major events embedded in documents and summarizes the supportive and the nonsupportive evidence. Opinion tracking captures subjective information from various genres and monitors the developments of opinions from spatial and temporal dimensions. To demonstrate and evaluate the proposed opinion mining algorithms, news and bloggers' articles are adopted. Documents in the evaluation corpora are tagged in different granularities from words, sentences to documents. In the experiments, positive and negative sentiment words and their weights are mined on the basis of Chinese word structures. The f-measure is 73.18% and 63.75% for verbs and nouns, respectively. Utilizing the sentiment words mined together with topical words, we achieve f-measure 62.16% at the sentence level and 74.37% at the document level.





MacMillan, M. "Mining E-mail to Improve Information Literacy Instruction." Evidence Based Library & Information Practice vol. 5, n. 2 (2010).  pp. 103-106. http://ejournals.library.ualberta.ca/index.php/EBLIP/article/viewFile/7996/6968



            The article discusses a study which described how an academic librarian mined the amount and type of e-mail questions sent by students and how such strategy resulted in an improvement of information literacy instruction. A background on the Mount Royal University (MRU) in Calgary, Alberta is offered. Data collection and extraction from August 2008 to July 2009 is described. A discussion on the implementation of changes to the information library delivery based on the research findings during the 2009-2010 academic year is detailed.





Marrero, M., S. Sánchez-Cuadrado, et al. "Sistemas de recuperación de información adaptados al dominio biomédico." El Profesional de la Información vol. 19, n. 3 (2010).  pp. 246-254. http://elprofesionaldelainformacion.metapress.com/media/7pab6qgbap6tpkecbp6j/contributions/u/4/8/0/u480m8g27l202736.pdf



            La terminología usada en biomedicina tiene rasgos léxicos que han requerido la elaboración de recursos terminológicos y sistemas de recuperación de información con funciones específicas. Las principales características son las elevadas tasas de sinonimia y homonimia, debidas a fenómenos como la proliferación de siglas polisémicas y su interacción con el lenguaje común. Los sistemas de recuperación de información en el dominio biomédico utilizan técnicas orientadas al tratamiento de estas peculiaridades léxicas. Se revisan algunas de estas técnicas, como la aplicación de Procesamiento del Lenguaje Natural (BioNLP), la incorporación de recursos léxico-semánticos, y la aplicación de Reconocimiento de Entidades (BioNER). Se presentan los métodos de evaluación adoptados para comprobar la adecuación de estas técnicas en la recuperación de recursos biomédicos.; The terminology used in biomedicine has lexical characteristics that have required the elaboration of terminological resources and information retrieval systems with specific functionalities. The main characteristics are the high rates of synonymy and homonymy, due to phenomena such as the proliferation of polysemic acronyms and their interaction with common language. Information retrieval systems in the biomedical domain use techniques oriented to the treatment of these lexical peculiarities. In this paper we review some of these techniques, such as the application of Natural Language Processing (BioNLP), the incorporation of lexical-semantic resources, and the application of Named Entity Recognition (BioNER). Finally, we present the evaluation methods adopted to assess the suitability of these techniques for retrieving biomedical resources.





Martínez Méndez, F. J. and R. López Carreño "Análisis prospectivo de las tendencias de desarrollo de los portales periodísticos españoles." Scire: Representación y organización del conocimiento vol. 11, n. 2 (2005).  pp. 33-62. http://ibersid.eu/ojs/index.php/scire/article/view/1520/1498



            El estudio taxonómico de la estructura de los portales periodísticos y el análisis de la frecuencia de aparición de sus componentes ?productos informativos, productos documentales y servicios de valor añadido? han fijado el actual estado de la cuestión sobre su desarrollo. La minería de datos es una tecnología que permite identificar tendencias y patrones seguidos en ese desarrollo a partir de la base de conocimientos aportada por los trabajos anteriormente llevados a cabo, permitiendo recuperar información suplementaria a la ya explicitada.


Fuente: Facultad de Traducción. U. de Salamanca.