Imagenes: Manouam y Uaeh.
¿Qué
és?
Descargar Monográfico
La minería de datos o exploración de datos (es la etapa de análisis de
“Knowledge Discovery in Databases” o KDD) es un campo de las ciencias de
la computación referido al proceso que intenta descubrir patrones en
grandes volúmenes de conjuntos de datos. Utiliza los métodos de la
inteligencia artificial, aprendizaje automático, estadística y sistemas
de bases de datos. El objetivo general del proceso de minería de datos
consiste en extraer información de un conjunto de datos y transformarla
en una estructura comprensible para su uso posterior. Además de la etapa
de análisis en bruto, que involucra aspectos de bases de datos y gestión
de datos, procesamiento de datos, el modelo y las consideraciones de
inferencia, métricas de Intereses, consideraciones de la Teoría de la
complejidad computacional, post-procesamiento de las estructuras
descubiertas, la visualización y actualización en línea.
El término es una palabra de moda, y es frecuentemente mal utilizado para
referirse a cualquier forma de datos a gran escala o procesamiento de la
información (recolección, extracción, almacenamiento, análisis y
estadísticas), pero también se ha generalizado a cualquier tipo de
sistema de apoyo informático decisión, incluyendo la inteligencia
artificial, aprendizaje automático y la inteligencia empresarial. En el
uso de la palabra, el término clave es el descubrimiento, comúnmente se
define como “la detección de algo nuevo”. Incluso el popular libro “La
minería de datos: sistema de prácticas herramientas de aprendizaje y
técnicas con Java” (que cubre todo el material de aprendizaje automático)
originalmente iba a ser llamado simplemente “la máquina de aprendizaje
práctico”, y el término “minería de datos” se añadió por razones de
marketing. A menudo, los términos más generales “(gran escala) el
análisis de datos”, o “análisis” -. o cuando se refiere a los métodos
actuales, la inteligencia artificial y aprendizaje automático, son más
apropiados.
Tags:
Minería de
datos,
Monográficos
Acosta Aguilera, M. E.
"Minería de datos y descubrimiento del
conocimiento." Info: Congreso Internacional de
Información vol. 5, n. (2004). pp.:
http://www.congreso-info.cu/UserFiles/File/Info/Info2004/Ponencias/056.pdf
Se
define la minería de datos como un conjunto de procedimientos, técnicas y
algoritmos para extraer la relaciones y patrones, la información oculta
en las bases de datos. Se establece la relación entre minería de datos y
descubrimiento de conocimiento y se describen los pasos en este proceso
de descubrimiento. Se abordan las tareas que abarca la minería de datos,
las componentes básicas de sus modelos y las técnicas y métodos más
usados. Se analizan algunos de los problemas o retos que aún debe
enfrentar la minería de datos para su total difusión. Se describen
algunas de sus aplicaciones. Se concluye que, independientemente de la
complejidad de la herramienta usada, el empleo de técnicas de minería de
datos redunda en beneficio para una organización con grandes bases de
datos.
Alejandra, S., V.-C. Christian, et al.
"Using data mining
techniques for exploring learning object repositories."
The Electronic Library vol. 29, n. 2 (2011). pp.
162-180.
http://dx.doi.org/10.1108/02640471111125140
Purpose – This paper aims to show the results obtained from the data
mining techniques application to learning objects (LO) metadata.
Design/methodology/approach – A general review of the literature was
carried out. The authors gathered and pre-processed the data, and then
analyzed the results of data mining techniques applied upon the LO
metadata. Findings – It is possible to extract new knowledge based on
learning objects stored in repositories. For example it is possible to
identify distinctive features and group learning objects according to
them. Semantic relationships can also be found among the attributes that
describe learning objects. Research limitations/implications – In the
first section, four test repositories are included for case study. In the
second section, the analysis is focused on the most complete repository
from the pedagogical point of view. Originality/value – Many publications
report results of analysis on repositories mainly focused on the number,
evolution and growth of the learning objects. But, there is a shortage of
research using data mining techniques oriented to extract new semantic
knowledge based on learning objects metadata.
Ana, K., D. Vladan, et al.
"Using data mining to improve digital
library services." The Electronic Library vol. 28,
n. 6 (2010). pp. 829-843.
http://dx.doi.org/10.1108/02640471011093525
Purpose – This paper aims to propose a solution for recommending digital
library services based on data mining techniques (clustering and
predictive classification). Design/methodology/approach – Data mining
techniques are used to recommend digital library services based on the
user's profile and search history. First, similar users were clustered
together, based on their profiles and search behavior. Then predictive
classification for recommending appropriate services to them was used. It
has been shown that users in the same cluster have a high probability of
accepting similar services or their patterns. Findings – The results
indicate that k-means clustering and Naive Bayes classification may be
used to improve the accuracy of service recommendation. The overall
accuracy is satisfying, while average accuracy depends on the specific
service. The results were better for frequently occurring services.
Research limitations/implications – Datasets were used from the KOBSON
digital library. Only clustering and predictive classification was
applied. If the correlation between the service and the institution were
higher, it would have better accuracy. Originality/value – The paper
applied different and efficient data mining techniques for clustering
digital library users based on their profiles and their search behavior,
i.e. users' interaction with library services, and obtain user patterns
with respect to the library services they use. A digital library may
apply this approach to offer appropriate services to new users more
easily. The recommendations will be based on library items that similar
users have already found useful.
Ananiadou, S.
"The National Centre for Text Mining: a vision for
the future." Ariadne vol., n. 53 (2007). pp.
np.
http://www.ariadne.ac.uk/issue53/ananiadou/
Describes the the National Centre for Text Mining (NaCTeM) and the main
scientific challenges it helps to solve together with issues related to
deployment, use and uptake of NaCTeM's text mining tools and services.
NaCTeM has developed a variety of text mining tools and services that
offer numerous benefits to a wide range of users. These range from
considerable reductions in time and effort for finding and linking
pertinent information from large scale textual resources, to customised
solutions in semantic data analysis and knowledge management. Enhancing
metadata is one of the important benefits of deploying text mining
services. TerMine (TM), a service for automatic term recognition, is
being used for subject classification, creation of taxonomies, controlled
vocabularies, ontology building and Semantic Web activities. As NaCTeM
enters into its second phase, the goal is to improve levels of
collaboration with Semantic Grid and Digital Library initiatives and
contributions to bridging the gap between the library world and the
e-Science world through an improved facility for constructing metadata
descriptions from textual descriptions via TM. Adapted from the source
document.
Ananiadou, S., J. Chruszcz, et al.
"The National Centre for Text
Mining: Aims and Objectives." Ariadne vol., n. 42
(2005). pp.:
http://www.ariadne.ac.uk/issue42/ananiadou/
. In
this article we describe the role of the National Centre for Text Mining
(NaCTeM). NaCTeM is operated by a consortium of three Universities: the
University of Manchester which leads the consortium, the University of
Liverpool and the University of Salford. The service activity is run by
the National Centre for Dataset Services (MIMAS), based within Manchester
Computing (MC). As part of previous and ongoing collaboration, NaCTeM
involves, as self-funded partners, world-leading groups at San Diego
Supercomputer Center (SDSC), the University of California at Berkeley
(UCB), the University of Geneva and the University of Tokyo. NaCTeM’s
initial focus is on bioscience and biomedical texts as there is an
increasing need for bio-text mining and automated methods to search,
access, extract, integrate and manage textual information from
large-scale bio-resources. NaCTeM was established in Summer 2004 with
funding from the Joint Information Systems Committee (JISC), the
Biotechnology and Biological Sciences Research Council (BBSRC) and the
Engineering and Physical Sciences Research Council (EPSRC), with the
consortium itself investing almost the same amount as it received in
funding.
Arakawa, Y., A. Kameda, et al.
"Adding Twitter-specific features
to stylistic features for classifying tweets by user type and number of
retweets." Journal of the Association for Information
Science and Technology vol. 65, n. 7 (2014). pp. 1416-1423.
http://dx.doi.org/10.1002/asi.23126
Recently, Twitter has received much attention, both from the general
public and researchers, as a new method of transmitting information.
Among others, the number of retweets (RTs) and user types are the two
important items of analysis for understanding the transmission of
information on Twitter. To analyze this point, we applied text
classification and feature extraction experiments using random forests
machine learning with conventional stylistic and Twitter-specific
features. We first collected tweets from 40 accounts with a high number
of followers and created tweet texts from 28,756 tweets. We then
conducted 15 types of classification experiments using a variety of
combinations of features such as function words, speech terms, Twitter's
descriptive grammar, and information roles. We deliberately observed the
effects of features for classification performance. The results indicated
that class classification per user indicated the best performance.
Furthermore, we observed that certain features had a greater impact on
classification. In the case of the experiments that assessed the level of
RT quantity, information roles had an impact. In the case of user
experiments, important features, such as the honorific postpositional
particle and auxiliary verbs, such as “desu” and “masu,” had an impact.
This research clarifies the features that are useful for categorizing
tweets according to the number of RTs and user types.
Baeza Yates, R.
"Tendencias en minería de datos de la
Web." El Profesional de la Información vol. 18, n.
1 (2009). pp. 5-10.
http://elprofesionaldelainformacion.metapress.com/media/9fppwlwxwpp61xvrlh87/contributions/3/7/5/7/3757882252861334.pdf
Panorámica general y tendencias de diferentes aspectos y aplicaciones de
la minería de datos en internet, con referencia a la Web 2.0, el spam,
análisis de búsquedas, redes sociales y la privacidad.; Overview and
trends of different aspects and applications of data mining on the
Internet, in relation to Web 2.0, spam, analysis of searches, social
networks and privacy.
Baeza-Yates, R.
"Excavando la web." El Profesional
de la Información vol. 13, n. 1 (2004). pp.:
http://elprofesionaldelainformacion.metapress.com/(j50ltv55nlwsbvu11bndbgmb)/app/home/journal.asp?referrer=parent&backto=homemainpublications,1,1
;
La web
es el fenómeno más importante de internet, demostrado por su crecimiento
exponencial y su diversidad. Por su volumen y riqueza de datos, los
buscadores de páginas se han convertido en una de las herramientas
principales. Son útiles cuando sabemos qué buscar. Sin embargo, es seguro
que la web tiene muchas respuestas a preguntas nunca imaginadas. El
proceso de descubrir relaciones o patrones interesantes en un conjunto de
datos se llama minería de datos (del inglés data mining) y en el caso de
la web se llama minería de la web (web mining). En este artículo
presentamos las ideas más importantes en minería de la web y algunas de
sus aplicaciones.
Baeza-Yates, R., C. Hurtado, et al.
"Improving search engines by
query clustering." Journal of the American Society for
Information Science and Technology vol. 58, n. 12 (2007).
pp.:
http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/
In
this paper, we present a framework for clustering Web search engine
queries whose aim is to identify groups of queries used to search for
similar information on the Web. The framework is based on a novel term
vector model of queries that integrates user selections and the content
of selected documents extracted from the logs of a search engine. The
query representation obtained allows us to treat query clustering
similarly to standard document clustering. We study the application of
the clustering framework to two problems: relevance ranking boosting and
query recommendation. Finally, we evaluate with experiments the
effectiveness of our approach.
Bekhuis, T.
"Conceptual biology, hypothesis discovery, and text
mining: Swanson's legacy." Biomedical Digital
Libraries vol. 3, n. 2 (2006). pp.:
http://www.bio-diglib.com/content/pdf/1742-5581-3-2.pdf
Innovative biomedical librarians and information specialists who want to
expand their roles as expert searchers need to know about profound
changes in biology and parallel trends in text mining. In recent years,
conceptual biology has emerged as a complement to empirical biology. This
is partly in response to the availability of massive digital resources
such as the network of databases for molecular biologists at the National
Center for Biotechnology Information. Developments in text mining and
hypothesis discovery systems based on the early work of Swanson, a
mathematician and information scientist, are coincident with the
emergence of conceptual biology. Very little has been written to
introduce biomedical digital librarians to these new trends. In this
paper, background for data and text mining, as well as for knowledge
discovery in databases (KDD) and in text (KDT) is presented, then a brief
review of Swanson's ideas, followed by a discussion of recent approaches
to hypothesis discovery and testing. 'Testing' in the context of text
mining involves partially automated methods for finding evidence in the
literature to support hypothetical relationships. Concluding remarks
follow regarding (a) the limits of current strategies for evaluation of
hypothesis discovery systems and (b) the role of literature-based
discovery in concert with empirical research. Report of an
informatics-driven literature review for biomarkers of systemic lupus
erythematosus is mentioned. Swanson's vision of the hidden value in the
literature of science and, by extension, in biomedical digital databases,
is still remarkably generative for information scientists, biologists,
and physicians.
Bengtson, J.
"Why I Can't Love the Homemade Semantic
Web." B Sides vol., n. (2010). pp.:
http://ir.uiowa.edu/bsides/20
Almost
all information professionals agree that the web needs to move to a
semantic structure. While work is proceeding in this area, movements to
get individual web authors to use semantic markup tools have also been on
the rise. This author argues that such efforts are ill conceived and he
proposes an automated alternative.
Benolt, G.
"Data Mining." Annual Review of
Information Science and Technology (ARIST) vol. 36, n.
(2002). pp.:
http://www3.interscience.wiley.com/cgi-bin/jissue/109883774
Data
mining (DM) is a multistaged process of extracting previously
unanticipated knowledge from large databases, and applying the results to
decision making. Data mining tools detect patterns from the data and
infer associations and rules from them. The extracted information may
then be applied to prediction or classification models by identifying
relations within fue data records or between databases. Those patterns
and rules can then guide decision making and forecast the effects of
those decisions. However, this definition may be applied equally to
'knowledge discovery in databases' (KDD). lndeed, in the recent
literature of DM and KDD, a source of confusion has emerged, making it
difficult to determine the exact parameters ofboth. KDD is sometimes
viewed as the broader discipline, of which data mining is merely a
component-specifically pattern extraction, evaluation, and cleansing
methods (Raghavan, Deogun, & Sever, 1998, p. 397). Thurasingham
(1999, p. 2) remarked that 'knowledge discovery,' 'pat- tern discovery,'
'data dredging,' 'information extraction,' and 'knowledge mining' are all
employed as synonyms for DM. Trybula, in hisARIST chapter on text mining,
observed that the 'existing work [in KDD] is confusing because the
terminology is inconsistent and poorly defined.
Blake, C.
"Text Mining." Annual Review of
Information Science and Technology (ARIST) vol. 45, n.
(2011). pp. 123-156.
ARIST,
published annually since 1966, is a landmark publication within the
information science community. It surveys the landscape of information
science and technology, providing an analytical, authoritative, and
accessible overview of recent trends and significant developments. The
range of topics varies considerably, reflecting the dynamism of the
discipline and the diversity of theoretical and applied perspectives.
While ARIST continues to cover key topics associated with
"classical" information science (e.g., bibliometrics,
information retrieval), the editor has selectively expanded its footprint
in an effort to connect information science more tightly with cognate
academic and professional communities.
Candás Romero, J.
"Minería de datos en bibliotecas:
bibliominería." BiD: textos universitaris de
biblioteconomia i documentació vol., n. 17 (2006). pp.:
http://www2.ub.edu/bid/consulta_articulos.php?fichero=17canda2.htm
Se
presenta una introducción teórica a la aplicación de la minería de datos
en bibliotecas, denominada bibliominería (propuesta terminológica en
español para el inglés bibliomining). Asimismo, se presentan algunas de
las posibles aplicaciones prácticas y cómo éstas sirven de apoyo a la
llamada Biblioteca 2.0 y a la creación y gestión de servicios más y mejor
orientados al usuario, basados en nuevas tecnologías. Finalmente se
analiza el problema de la privacidad en la aplicación de la
bibliominería.
Capuano, E. A.
"O poder cognitivo das redes neurais artificiais
modelo ART1 na recuperação da informação." Ciência da
informaçao vol. 38, n. 1 (2009). pp. 9-30.
http://www.scielo.br/pdf/ci/v38n1/01.pdf
This
article reports an experiment with a computational simulation of an
Information Retrieval System constituted of a textual indexing base from
a sample of documents, an artificial neural network software implementing
Adaptive Resonance Theory concepts for the process of ordering and
presenting outputs, and a human user interacting with the system in query
processing. The goal of the experiment was to demonstrate (i) the
usefulness of Carpenter and Grossberg (1988) neural networks based on
that theory, and (ii) the power of semantic resolution based on
sintagmatic indexing of the SiRILiCO approach proposed by
Gottschalg-Duque (2005), for whom a noun phrase or proposition is a
linguistic unity constituted of meaning larger than a word meaning and
smaller than a story telling or a theory meaning. The experiment
demonstrated the effectiveness and efficiency of an Information Retrieval
System joining together those resources, and the conclusion is that such
computational environment will be capable of dynamic and on-line
clustering with continuing inputs and learning in a non-supervised
fashion, without batch training needs (off-line), to answer user queries
in computer networks with promising performance. Adapted from the source
document.
Chan-Chine, C. and C. Ruey-Shun
"Using data mining technology to
solve classification problems: A case study of campus digital
library." The Electronic Library vol. 24, n. 3
(2006). pp.:
http://www.emeraldinsight.com/Insight/ViewContentServlet?Filename=Published/EmeraldFullTextArticle/Articles/2630240303.html
Traditional library catalogs have become inefficient and inconvenient in
assisting library users. Readers may spend a lot of time searching
library materials via printed catalogs. Readers need an intelligent and
innovative solution to overcome this problem. The paper seeks to examine
data mining technology which is a good approach to fulfill readers'
requirements. Design/methodology/approach – Data mining is considered to
be the non-trivial extraction of implicit, previously unknown, and
potentially useful information from data. This paper analyzes readers'
borrowing records using the techniques of data analysis, building a data
warehouse, and data mining. Findings – The paper finds that after mining
data, readers can be classified into different groups according to the
publications in which they are interested. Some people on the campus also
have a greater prefeence for multimedia data. Originality/value – The
data mining results shows that all readers can be categorized into five
clusters, and each cluster has its own characteristics. The frequency
with which graduates and associate researchers borrow multimedia data is
much higher. This phenomenon shows that these readers have a higher
preference for accepting digitized publications. Also, the number of
readers borrowing multimedia data has increased over the years. This
trend indicates that readers preferences are gradually shifting towards
reading digital publications.
Chaves Ramos, H. d. S. and M. Brascher
"Aplicação da descoberta
de conhecimento em textos para apoio à construção de indicadores
infométricos para a área de C&T." Ciência da
informaçao vol. 38, n. 2 (2009). pp. 56-68.
http://www.scielo.br/pdf/ci/v38n2/05.pdf
This
article describes the results of a research applying Knowledge Discovery
in Texts (KDT) in textual contents, which are important sources of
information for decision-making purposes. The main objective of the
research is to verify the effectiveness of KDT for discovering
information that may support the construction of ST&I indicators and
for the definition of public policies. The case study of the research was
the textual content of the Brazilian Service for Technical Answers
(Servico Brasileiro de Respostas Tecnicas -- SBRT) and the technique
adopted was document clustering from terms mined in the database. The use
of DCT for extracting hidden information -- that could not be found by
using the traditional information retrieval -- from textual documents
proved to be efficient. The presence of environmental concerns in the
demand posted by SBRT's users and the applicability of DCT to orient
internal policies for SBRT network were also evidenced by the research
results. Adapted from the source document.
Chen, C.-L., F. S. C. Tseng, et al.
"Mining fuzzy frequent
itemsets for hierarchical document clustering."
Information Processing & Management vol. 46, n. 2
(2010). pp. 193-211.
http://www.sciencedirect.com/science/article/B6VC8-4XK9J7J-1/2/733c0f885a05224f80d3f0ac97148e41
As
text documents are explosively increasing in the Internet, the process of
hierarchical document clustering has been proven to be useful for
grouping similar documents for versatile applications. However, most
document clustering methods still suffer from challenges in dealing with
the problems of high dimensionality, scalability, accuracy, and
meaningful cluster labels. In this paper, we will present an effective
Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach,
which uses fuzzy association rule mining algorithm to improve the
clustering accuracy of Frequent Itemset-Based Hierarchical Clustering
(FIHC) method. In our approach, the key terms will be extracted from the
document set, and each document is pre-processed into the designated
representation for the following mining process. Then, a fuzzy
association rule mining algorithm for text is employed to discover a set
of highly-related fuzzy frequent itemsets, which contain key terms to be
regarded as the labels of the candidate clusters. Finally, these
documents will be clustered into a hierarchical cluster tree by referring
to these candidate clusters. We have conducted experiments to evaluate
the performance based on Classic4, Hitech, Re0, Reuters, and Wap
datasets. The experimental results show that our approach not only
absolutely retains the merits of FIHC, but also improves the accuracy
quality of FIHC.
Chen, Y.-L., Y.-H. Liu, et al.
"A text mining approach to assist
the general public in the retrieval of legal documents."
Journal of the American Society for Information Science and
Technology vol. 64, n. 2 (2013). pp. 280-290.
http://dx.doi.org/10.1002/asi.22767
Applying text mining techniques to legal issues has been an emerging
research topic in recent years. Although some previous studies focused on
assisting professionals in the retrieval of related legal documents, they
did not take into account the general public and their difficulty in
describing legal problems in professional legal terms. Because this
problem has not been addressed by previous research, this study aims to
design a text-mining-based method that allows the general public to use
everyday vocabulary to search for and retrieve criminal judgments. The
experimental results indicate that our method can help the general
public, who are not familiar with professional legal terms, to acquire
relevant criminal judgments more accurately and effectively.
Clare, T.
"Advances in Information Retrieval."
Journal of Documentation vol. 68, n. 5 (2012). pp.:
http://www.emeraldinsight.com/journals.htm?articleid=17051071
This
publication provides an excellent “state of the art” review and
description of recent developments and improvements within information
retrieval (IR). It is very broad ranging in its coverage and the
contributions are organised under the following headings: natural
language processing (NLP) and text mining; web IR; evaluation; multi
media IR; distributed IR and performance issues; IR theory and formal
models; personalisation and recommendation; domain specific IR and cross
language IR; user issues. In addition to 44 revised full papers, plus the
keynote address, it includes abstracts of invited talks on emerging
issues including collaborative web searching, the impact of visualisation
technology on NLP and developments in automatic image annotation for
multimedia IR. A fuller description of these talks or a clear reference
to relevant papers would have been useful. Finally it includes posters
and a description of some demonstrations of new IR systems.
Cobo, A., R. Rocha, et al.
"Gestão da informação em ambientes
globais: computação bio-inspirada em repositórios de documentos
econômicos multilingues." Informação & Sociedade:
Estudos vol. 23, n. 1 (2013). pp.:
http://periodicos.ufpb.br/ojs/index.php/ies/article/view/15128
The
information is a strategic resource of first order for organizations, so
it is essential to have methodologies and tools that allow them to
properly manage information and extract knowledge from it. Organizations
also need knowledge generation strategies using unstructured textual
information from different sources and in different languages. This paper
presents two bio-inspired approaches to clustering multilingual document
collections in a particular field (economics and business). This problem
is quite significant and necessary to organize the huge volume of
information managed within organisations in a global context
characterised by the intensive use of Information and Communication
Technologies. The proposed clustering algorithms take inspiration from
the behaviour of real ant colonies and can be applied to identify groups
of related multilingual documents in the field of economics and business.
In order to obtain a language independent vector representation, several
linguistic resources and tools are used. The performance of the
algorithms is analysed using a corpus of 250 documents in Spanish and
English from different functional areas of the enterprise, and
experimental results are presented. The results demonstrate the
usefulness and effectiveness of the algorithms as clustering
technique.
Cobo Ortega, A., R. Rocha Blanco, et al.
"Descubrimiento de
conocimiento en repositorios documentales mediante técnicas de Minería de
Texto y Swarm Intelligence." Rect@ : Revista Electrónica
de Comunicaciones y Trabajos de ASEPUMA vol., n. 10 (2009).
pp. 105-124.
http://dialnet.unirioja.es/servlet/extart?codigo=3267050
El uso
combinado de metodologías de minería de texto y técnicas de Inteligencia
Artificial favorece los procesos de gestión documental y optimiza los
mecanismos de categorización, extracción automática de conocimiento y
agrupamiento de colecciones documentales. En el trabajo se propone un
modelo de gestión documental integral para el proceso de información no
estructurada. Se utilizan glosarios y tesauros especializados para
establecer relaciones semánticas entre los términos, y técnicas de Swarm
Intelligence para la extracción del conocimiento. El modelo ha sido
implementado en una aplicación de uso intuitivo, multilingüe e
integradora de técnicas de minería de texto
The combined use of text mining methodologies and Artificial Intelligence
techniques articulate document management processes to optimize
categorization mechanisms, automatic knowledge extraction and grouping
document collections. The article proposed an integral document
management model to process unstructured information.In this context,
semantic relations in document collections are implemented by specialized
thesaurus and glossaries, and knowledge feature extraction are
facilitated by Swarm Intelligence techniques. The model has been
implemented in an intuitive, integral and multilingual text mining
application
Cui, H.
"Competency evaluation of plant character ontologies
against domain literature." Journal of the American
Society for Information Science and Technology vol. 61, n. 6
(2010). pp. n/a-n/a-n/a-n/a.
http://doi.wiley.com/10.1002/asi.21325
Specimen identification keys are still the most commonly created tools
used by systematic biologists to access biodiversity information.
Creating identification keys requires analyzing and synthesizing large
amounts of information from specimens and their descriptions and is a
very labor-intensive and time-consuming activity. Automating the
generation of identification keys from text descriptions becomes a highly
attractive text mining application in the biodiversity domain.
Fine-grained semantic annotation of morphological descriptions of
organisms is a necessary first step in generating keys from text.
Machine-readable ontologies are needed in this process because most
biological characters are only implied (i.e., not stated) in
descriptions. The immediate question to ask is How well do existing
ontologies support semantic annotation and automated key generation? With
the intention to either select an existing ontology or develop a unified
ontology based on existing ones, this paper evaluates the coverage,
semantic consistency, and inter-ontology agreement of a biodiversity
character ontology and three plant glossaries that may be turned into
ontologies. The coverage and semantic consistency of the
ontology/glossaries are checked against the authoritative domain
literature, namely, Flora of North America and Flora of China. The
evaluation results suggest that more work is needed to improve the
coverage and interoperability of the ontology/glossaries. More concepts
need to be added to the ontology/glossaries and careful work is needed to
improve the semantic consistency. The method used in this paper to
evaluate the ontology/glossaries can be used to propose new candidate
concepts from the domain literature and suggest appropriate
definitions.
de la Puente, M.
"Gestión del conocimiento y minería de
datos." E-LIS: E-Prints in Library and Information
Science vol., n. (2010). pp.:
http://www.ccinfo.com.ar/documentos_trabajo/DT_019.pdf
La
Gestión del Conocimiento se refiere al conjunto de procesos desarrollados
en una organización para crear, organizar, almacenar y transferir el
conocimiento. La Minería de Datos es la disciplina que tiene por objetivo
la extracción de conocimiento implícito en grandes bases de datos. La
Minería de Datos tiene un papel fundamental en el proceso de convertir en
explicito al conocimiento implícito y en las distintas etapas del proceso
de Gestión del Conocimiento en las organizaciones
Del-Fresno-García, M.
"Infosociabilidad: monitorización e
investigación en la web 2.0 para la toma de decisiones."
El Profesional de la Información vol. 20, n. 5
(2011). pp. 548 - 554.
http://eprints.rclis.org/bitstream/10760/16150/1/Miguel-Del-Fresno-Infosociabilidad-reputacion-Online.pdf
This
methodology offers an approach to studying the information available
within Web 2.0 Media and User-Generated Content (MUGC). The large-scale
generation of online information is the result of collective social
action based on information: Infosociability. Competitive Intelligence
(CI) aims to monitor and research a company’s web 2.0 environment for
information relevant to its decision-making process. Facing the
possibilities and limitations that today’s technology offers for
processing the communication of meanings and abstract ideas in text
format, a methodology derived from empirical research on web 2.0 is
proposed. Monitoring and research are identified as the two key processes
that generate insights aimed to facilitate decision-making. The relevance
of each stage is illustrated with reference to the diverse methodological
challenges encountered while extracting and analyzing large amounts of
online information.
Eíto Brun, R. and J. A. Senso
"Minería textual."
El Profesional de la Información vol. 13, n. 1
(2004). pp.:
http://www.elprofesionaldelainformacion.com/contenidos/2004/enero/2.pdf
This
article attempts to establish a definition for 'text mining' and, at the
same time, to identify its relationship with other fields: text
retrieval, data mining and computational linguistics. In addition, there
is an analysis of the impact of text mining, a reference to existing
commercial applications on the market and, lastly, a brief description of
the techniques used for developing and implementing text mining
systems.
Eric Lease, M.
"Use and understand: the inclusion of services
against texts in library catalogs and “discovery systems”."
Library Hi Tech vol. 30, n. 1 (2012). pp. 35-59.
http://dx.doi.org/10.1108/07378831211213201
Purpose – The purpose of this article is to outline possibilities for the
integration of text mining and other digital humanities computing
techniques into library catalogs and “discovery systems”.
Design/methodology/approach – The approach has been to survey existing
text mining apparatus and apply this to traditional library systems.
Findings – Through this process it is found that there are many ways
library interfaces can be augmented to go beyond the processes of find
and get and evolve to include processes of use and understand.
Originality/value – To the best of the author's knowledge, this type of
augmentation has yet to be suggested or implemented across
libraries.
Escorsa, P. and R. Maspons
"Los mapas tecnológicos."
De la vigilancia tecnológica a la inteligencia competitiva
vol., n. 5 (2001). pp.:
http://148.216.10.83/VIGILANCIA/capitulo_5.htm
Bajo
el título genérico de Mapas Tecnológicos se reúnen en este capítulo
algunos temas diversos muy importantes para la inteligencia empresarial,
tales como la elaboración de los mapas propiamente dichos o la relación
entre los productos y/o las tecnologías con los mercados, para lo que se
propone una matriz que ayude a descubrir oportunidades, aunque es preciso
advertir que este método se halla todavía en una fase muy incipiente. Por
último se introduce brevemente la minería de datos (data mining), de uso
creciente en las empresas, a pesar de que utiliza técnicas como las redes
neuronales o los árboles de decisión, no descritas en el capítulo
anterior.
Febles Rodríguez, J. P. and A. González Pérez
"Aplicación de la
minería de datos en la bioinformática." Acimed: revista
cubana de los profesionales de la información y la comunicación en
salud vol. 10, n. 2 (2002). pp.:
http://bvs.sld.cu/revistas/aci/vol10_2_02/aci03202.htm
En los
próximos años ocurrirá un avance espectacular de las ciencias biomédicas
como resultado del proyecto Genoma Humano. Las nuevas tecnologías,
basadas en la genética molecular y la informática, son claves para este
desarrollo, pues ellas suministran potentes instrumentos para la
obtención y el análisis de la información genética. La aparición de
nuevas tecnologías ha posibilitado el desarrollo de la genómica, al
facilitar el estudio de las interacciones de los genes y su influencia en
el desarrollo de enfermedades, todo lo cual influye en el diagnóstico
clínico, la investigación de nuevos fármacos, la epidemiología y la
informática médica. En los últimos años, la minería de datos (data
mining) ha experimentado un auge como soporte para las filosofías de la
gestión de la información y el conocimiento, así como para el
descubrimiento del significado que poseen los datos almacenados en
grandes bancos.
Firestone, J. M.
"Mining for information gold."
Information Management Journal vol. 39, n. 5 (2005).
pp. 47-50, 52.
Discusses the concept of data mining and its value for records and
information management (RIM) professionals in enhancing the quality of
information. Shows how to get started in data mining and considers some
of the concerns about the technique, including assuring the quality of
the data mined in terms of currency, completeness and accuracy. Attempts
to predict the future for the technique and suggests that this might lie
in the direction of innovation currently being undertaken at university
laboratories, the increasing popularity of open analytical platforms from
vendors, the integration of business intelligence and data mining
technologies, and the continuing development of intelligent agents and
distributed knowledge processing for processing the mass of information
becoming available. (Quotes from original text)
Fox, L. M., L. A. Williams, et al.
"Negotiating a Text Mining
License for Faculty Researchers." Information Technology
and Libraries vol. 33, n. 3 (2014). pp. 5-21.
http://ejournals.bc.edu/ojs/index.php/ital/article/view/5485
This
case study examines strategies used to leverage the library’s existing
journal licenses to obtain a large collection of full-text journal
articles in extensible markup language (XML) format; the right to text
mine the collection; and the right to use the collection and the data
mined from it for grant-funded research to develop biomedical natural
language processing (BNLP) tools. Researchers attempted to obtain content
directly from PubMed Central (PMC). This attempt failed due to limits on
use of content in PMC. Next researchers and their library liaison
attempted to obtain content from contacts in the technical divisions of
the publishing industry. This resulted in an incomplete research data
set. Then researchers, the library liaison, and the acquisitions
librarian collaborated with the sales and technical staff of a major
science, technology, engineering, and medical (STEM) publisher to
successfully create a method for obtaining XML content as an extension of
the library’s typical acquisition process for electronic resources. Our
experience led us to realize that text mining rights of full-text
articles in XML format should routinely be included in the negotiation of
the library’s licenses.
Franganillo, J.
"Implicaciones éticas de la minería de
datos." Anuario ThinkEPI vol., n. (2010).
pp.:
http://www.thinkepi.net/implicaciones-eticas-de-la-mineria-de-datos
Ciertos expertos pueden describir la conducta de un conjunto de personas
basándose en los registros digitales de lo que hacen. La descripción es
detallada: qué hacen, qué compran, cómo trabajan, con quién se
relacionan. Es la minería de datos, que suele usarse para discriminar en
positivo: al saber, por ejemplo, qué hábitos de compra tiene un
determinado colectivo, es posible orientarles más efectivamente una
campaña publicitaria. Pero también puede usarse para discriminar en
negativo: el análisis del registro del correo electrónico de los
empleados de una empresa permite identificar a quienes están alimentando
redes informales y, en consecuencia, los directivos podrían cambiar la
actitud hacia aquéllos. Un estudio observa que quienes compran coches
rojos en Francia son más propensos a incumplir el pago de los créditos
(Chakrabarti, 2008): esto podría modificar las condiciones crediticias de
quienes escogen el rojo para el coche. Suele clasificarse a las personas
según estereotipos que se basan en correlaciones estadísticas, pero éstas
implican los errores de toda generalización, y así pagan unos por
otros
Fu Lee, W. and C. C. Yang
"Mining Web data for Chinese
segmentation." Journal of the American Society for
Information Science and Technology vol. 58, n. 12 (2007).
pp.:
http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/
Modern
information retrieval systems use keywords within documents as indexing
terms for search of relevant documents. As Chinese is an ideographic
character-based language, the words in the texts are not delimited by
white spaces. Indexing of Chinese documents is impossible without a
proper segmentation algorithm. Many Chinese segmentation algorithms have
been proposed in the past. Traditional segmentation algorithms cannot
operate without a large dictionary or a large corpus of training data.
Nowadays, the Web has become the largest corpus that is ideal for Chinese
segmentation. Although most search engines have problems in segmenting
texts into proper words, they maintain huge databases of documents and
frequencies of character sequences in the documents. Their databases are
important potential resources for segmentation. In this paper, we propose
a segmentation algorithm by mining Web data with the help of search
engines. On the other hand, the Romanized pinyin of Chinese language
indicates boundaries of words in the text. Our algorithm is the first to
utilize the Romanized pinyin to segmentation. It is the first unified
segmentation algorithm for the Chinese language from different
geographical areas, and it is also domain independent because of the
nature of the Web. Experiments have been conducted on the datasets of a
recent Chinese segmentation competition. The results show that our
algorithm outperforms the traditional algorithms in terms of precision
and recall. Moreover, our algorithm can effectively deal with the
problems of segmentation ambiguity, new word (unknown word) detection,
and stop words.
Fu, T., A. Abbasi, et al.
"A focused crawler for Dark Web
forums." Journal of the American Society for Information
Science and Technology vol. 61, n. 6 (2010). pp. 1213 -
1231.
http://doi.wiley.com/10.1002/asi.21323
The
unprecedented growth of the Internet has given rise to the Dark Web, the
problematic facet of the Web associated with cybercrime, hate, and
extremism. Despite the need for tools to collect and analyze Dark Web
forums, the covert nature of this part of the Internet makes traditional
Web crawling techniques insufficient for capturing such content. In this
study, we propose a novel crawling system designed to collect Dark Web
forum content. The system uses a human-assisted accessibility approach to
gain access to Dark Web forums. Several URL ordering features and
techniques enable efficient extraction of forum postings. The system also
includes an incremental crawler coupled with a recall-improvement
mechanism intended to facilitate enhanced retrieval and updating of
collected content. Experiments conducted to evaluate the effectiveness of
the human-assisted accessibility approach and the
recall-improvement-based, incremental-update procedure yielded favorable
results. The human-assisted approach significantly improved access to
Dark Web forums while the incremental crawler with recall improvement
also outperformed standard periodic- and incremental-update approaches.
Using the system, we were able to collect over 100 Dark Web forums from
three regions. A case study encompassing link and content analysis of
collected forums was used to illustrate the value and importance of
gathering and analyzing content from such online communities.
Gálvez, C.
"Minería de textos: la nueva generación de análisis de
literatura científica en biología molecular y genómica."
Departamento de Ciência da Informaç¦o, Universidade Federal de
Santa Catarina (Brasil) vol., n. (2008). pp.:
http://eprints.rclis.org/13361/
Una
vez descifrado la secuencia del genoma humano, el paradigma de
investigación ha cambiado dando paso a la descripción de las funciones de
los genes y a futuros avances en la lucha contra enfermedades. Este nuevo
contexto ha despertado el interés de la Bioinformática, que combina
métodos de las Ciencias de la Vida con las Ciencias de la Información
haciendo posible el acceso a la gran cantidad de información biológica
almacenada en las bases de datos, y de la Genómica, dedicada al estudio
de las interacciones de los genes y su influencia en el desarrollo de
enfermedades. En este contexto, la minería de textos surge como un
instrumento emergente para el análisis de la literatura científica. Una
tarea habitual de la minería de textos en Biología Molecular y Genómica
es el reconocimiento de entidades biológicas, tales como genes, proteínas
y enfermedades. El paso siguiente en el proceso de minería lo constituye
la identificación entre entidades biológicas, tales como el tipo de
interacción entre gengen, gen-enfermedad, gen-proteína, para interpretar
funciones biológicas, o formular hipótesis de investigación. El objetivo
de este trabajo es examinar el auge y las limitaciones la nueva
generación de herramientas de análisis de la información en lenguaje
natural, almacenada en bases de datos bibliográficas, como PubMed o
MEDLINE.
Gálvez, C. and F. Moya-Anegón
"Text-mining research in
genomics." International Association for Development of
the Information Society (IADIS) vol., n. (2008). pp.
277-283.
http://www.computing-conf.org/
Biomedical text-mining have great promise to improve the usefulness of
genomic researchers. The goal of text-mining is analyzed large
collections of unstructured documents for the purposes of extracting
interesting and non-trivial patterns of knowledge. The analysis of
biomedical texts and available databases, such as Medline and PubMed, can
help to interpret a phenomenon, to detect gene relations, or to establish
comparisons among similar genes in different specific databases. All
these processes are crucial for making sense of the immense quantity of
genomic information. In genomics, text-mining research refers basically
to the creation of literature networks of related biological entities.
Text data represent the genomics knowledge base and can be mined for
relationships, literature networks, and new discoveries by literature
relational chaining. However, text-mining is an emerging field without a
clear definition in the genomics. This work presents some applications of
text-mining to genome-based research, such as the genomic term
identification in curation processes, the formulation of hypotheses about
disease, the visualization of biological relationships, or the
life-science domain mapping.
Gómez Aguilar, D. A., F. J. García Peñalvo, et al.
"Analítica
visual en e-learning." Visual analytics in
e-learning vol. 23, n. 3 (2014). pp. 236-245.
http://elprofesionaldelainformacion.metapress.com/app/home/contribution.asp?referrer=parent&backto=issue,3,13;journal,2,96;homemainpublications,1,1
;
Las
tecnologías utilizadas en los procesos de aprendizaje implican el
registro de todas las actividades realizadas. Estos datos se pueden
aprovechar para la evaluación de estudiantes, profesores y de los propios
procesos. Sin embargo, aunque existe esta gran cantidad de datos, sigue
siendo difícil para los profesores (y otras partes interesadas) verificar
hipótesis, extraer conclusiones o tomar decisiones basadas en hechos o
situaciones detectadas. Se presenta un modelo de análisis de datos
educativos basado en analítica visual, analítica del aprendizaje y
analítica académica. Por medio de una herramienta de software permite
realizar análisis de datos exploratorios y confirmatorios, en interacción
con la información obtenida de un sistema típico de gestión de
aprendizaje. El objetivo principal es el descubrimiento de nuevo
conocimiento sobre el proceso de aprendizaje educativo que, a su vez,
posibilite la mejora de éste. (A.)
Haravu, L. J. and A. Neelameghan
"Text Mining and Data Mining in
Knowledge Organization and Discovery: The Making of Knowledge-Based
Products." Cataloging & classification
quarterly vol. 37, n. 1-2 (2003). pp.:
https://www.haworthpress.com/store/ArticleAbstract.asp?sid=82ATF0VJW1QK8MGFETSH163PXESHFAM9&ID=40765
Discusses the importance of knowledge organization in the context of the
information overload caused by the vast quantities of data and
information accessible on internal and external networks of an
organization. Defines the characteristics of a knowledge-based product.
Elaborates on the techniques and applications of text mining in
developing knowledge products. Presents two approaches, as case studies,
to the making of knowledge products: (1) steps and processes in the
planning, designing and development of a composite multilingual
multimedia CD product, with the potential international, inter-cultural
end users in view, and (2) application of natural language processing
software in text mining. Using a text mining software, it is possible to
link concept terms from a processed text to a related thesaurus,
glossary, schedules of a classification scheme, and facet structured
subject representations. Concludes that the products of text mining and
data mining could be made more useful if the features of a faceted scheme
for subject classification are incorporated into text mining techniques
and products.
He, Y. L. and S. C. Hui
"Mining a Web Citation Database for
Author Co-Citation Analysis." Information Processing &
Management vol. 38, n. 4 (2002). pp.:
http://www.sciencedirect.com/science/journal/03064573
Author
co-citation analysis (ACA) has been widely used in bibliometrics as an
analytical method in analyzing the intellectual structure of science
studies. It can be used to identify authors from the same or similar
research fields. However, such analysis method relies heavily on
statistical tools to perform the analysis and requires human
interpretation, Web Citation Database is a data warehouse used for
storing citation indices of Web publications. In this paper. we propose a
mining process to automate the ACA based on the Web Citation Database.
The mining process uses agglomerative hierarchical clustering (AHC) as
the mining technique for author clustering and multidimensional scaling
(MDS) for displaying author cluster maps. The clustering results and
author cluster map have been incorporated into a citation-based retrieval
system known as PubSearch to support author retrieval of Web
publications.
Heinrichs, J. H. and J.-S. Lim
"Integrating web-based data mining
tools with business models for knowledge management."
Decision Support Systems vol. 35, n. 1 (2003). pp.
103-112.
http://www.sciencedirect.com/science/article/B6V8S-45X0BS7-3/2/2134b5fcc7cf3c9d9ac56149e96489ef
As
firms begin to implement web-based presentation and data mining tools to
enhance decision support capability, the firm's knowledge workers must
determine how to most effectively use these new web-based tools to
deliver competitive advantage. The focus of this study is on evaluating
how knowledge workers integrate these tools into their information and
knowledge management requirements. The relationship between the
independent variables (web-based data mining software tools and business
models) and the dependent variable (strategic performance capabilities)
is empirically tested in this study. The results from this study
demonstrate the positive interaction effect between the tools and models
application on strategic performance capability.
Heneberg, P.
"Supposedly uncited articles of Nobel laureates and
Fields medalists can be prevalently attributed to the errors of omission
and commission." Journal of the American Society for
Information Science and Technology vol. 64, n. 3 (2013).
pp. 448-454.
http://dx.doi.org/10.1002/asi.22788
Several independent authors reported a high share of uncited
publications, which include those produced by top scientists. This share
was repeatedly reported to exceed 10% of the total papers produced,
without any explanation of this phenomenon and the lack of difference in
uncitedness between average and successful researchers. In this report,
we analyze the uncitedness among two independent groups of highly visible
scientists (mathematicians represented by Fields medalists, and
researchers in physiology or medicine represented by Nobel Prize
laureates in the respective field). Analysis of both groups led to the
identical conclusion: over 90% of the uncited database records of highly
visible scientists can be explained by the inclusion of editorial
materials progress reports presented at international meetings (meeting
abstracts), discussion items (letters to the editor, discussion),
personalia (biographic items), and by errors of omission and commission
of the Web of Science (WoS) database and of the citing documents. Only a
marginal amount of original articles and reviews were found to be uncited
(0.9 and 0.3%, respectively), which is in strong contrast with the
previously reported data, which never addressed the document types among
the uncited records.
Hsinchun, C.
"Introduction to the JASIST Special Topic Section on
Web Retrieval and Mining A Machine Learning Perspective."
Journal of the American Society for Information Science and
Technology vol. 54, n. 7 (2002). pp.:
This
special issue consists of six papers that report research in web
retrieval and mining. Most papers apply or adapt various pre-web
retrieval and analysis techniques to other interesting and challenging
web-based applications.
Hsinchun Chen, M. C.
"Web mining: machine learning for web
applications." Annual Review of Information Science and
Technology (ARIST) vol. 38, n. (2004). pp.:
http://www3.interscience.wiley.com/cgi-bin/fulltext/111091572/PDFSTART
With
more than two billion pages created by millions of Web page authors and
organizations, the World Wide Web is a tremendously rich knowledge base.
The knowledge comes not only from the content of the pages themselves,
but also from the unique characteristics of the Web, such as its
hyperlink structure and its diversity of content and languages. Analysis
of these characteristics often reveals interesting patterns and new
knowledge. Such knowledge can be used to improve users' efficiency and
effectiveness in searching for information on the Web, and also for
applications unrelated to the Web, such as support for decision making or
business management.
Huang, C., T. Fu, et al.
"Text-based video content classification
for online video-sharing sites." Journal of the American
Society for Information Science and Technology vol. 61, n. 5
(2010). pp. 891-906.
http://dx.doi.org/10.1002/asi.21291
With
the emergence of Web 2.0, sharing personal content, communicating ideas,
and interacting with other online users in Web 2.0 communities have
become daily routines for online users. User-generated data from Web 2.0
sites provide rich personal information (e.g., personal preferences and
interests) and can be utilized to obtain insight about cyber communities
and their social networks. Many studies have focused on leveraging
user-generated information to analyze blogs and forums, but few studies
have applied this approach to video-sharing Web sites. In this study, we
propose a text-based framework for video content classification of
online-video sharing Web sites. Different types of user-generated data
(e.g., titles, descriptions, and comments) were used as proxies for
online videos, and three types of text features (lexical, syntactic, and
content-specific features) were extracted. Three feature-based
classification techniques (C4.5, Naïve Bayes, and Support Vector Machine)
were used to classify videos. To evaluate the proposed framework,
user-generated data from candidate videos, which were identified by
searching user-given keywords on YouTube, were first collected. Then, a
subset of the collected data was randomly selected and manually tagged by
users as our experiment data. The experimental results showed that the
proposed approach was able to classify online videos based on users'
interests with accuracy rates up to 87.2%, and all three types of text
features contributed to discriminating videos. Support Vector Machine
outperformed C4.5 and Naïve Bayes techniques in our experiments. In
addition, our case study further demonstrated that accurate
video-classification results are very useful for identifying implicit
cyber communities on video-sharing Web sites.R.
Hwang, S.-Y., W.-S. Yang, et al.
"Automatic index construction
for multimedia digital libraries." Information Processing
& Management vol. 46, n. 3 (2010). pp. 295-307.
http://www.sciencedirect.com/science/article/B6VC8-4XM6NHT-2/2/50b0f024e70987516fe1fba5a5637955
Indexing remains one of the most popular tools provided by digital
libraries to help users identify and understand the characteristics of
the information they need. Despite extensive studies of the problem of
automatic index construction for text-based digital libraries, the
construction of multimedia digital libraries continues to represent a
challenge, because multimedia objects usually lack sufficient text
information to ensure reliable index learning. This research attempts to
tackle the problem of automatic index construction for multimedia objects
by employing Web usage logs and limited keywords pertaining to multimedia
objects. The tests of two proposed algorithms use two different data sets
with different amounts of textual information. Web usage logs offer
precious information for building indexes of multimedia digital libraries
with limited textual information. The proposed methods generally yield
better indexes, especially for the artwork data set.
Jia, L., Z. Pengzhu, et al.
"External concept support for group
support systems through Web mining." Journal of the
American Society for Information Science and Technology vol. 60,
n. 5 (2009). pp. 1057.
http://proquest.umi.com/pqdweb?did=1682801381&Fmt=7&clientId=40776&RQT=309&VName=PQD
External information plays an important role in group decision-making
processes, yet research about external information support for Group
Support Systems (GSS) has been lacking. In this study, we propose an
approach to build a concept space to provide external concept support for
GSS users. Built on a Web mining algorithm, the approach can mine a
concept space from the Web and retrieve related concepts from the concept
space based on users' comments in a real-time manner. We conduct two
experiments to evaluate the quality of the proposed approach and the
effectiveness of the external concept support provided by this approach.
The experiment results indicate that the concept space mined from the Web
contained qualified concepts to stimulate divergent thinking. The results
also demonstrate that external concept support in GSS greatly enhanced
group productivity for idea generation tasks.
Jiang, X. and A.-H. Tan
"CRCTOL: A semantic-based domain ontology
learning system." Journal of the American Society for
Information Science and Technology vol., n. (2009). pp.:
http://dx.doi.org/10.1002%2Fasi.21231
Domain
ontologies play an important role in supporting knowledge-based
applications in the Semantic Web. To facilitate the building of
ontologies, text mining techniques have been used to perform ontology
learning from texts. However, traditional systems employ shallow natural
language processing techniques and focus only on concept and taxonomic
relation extraction. In this paper we present a system, known as
Concept-Relation-Concept Tuple-based Ontology Learning (CRCTOL), for
mining ontologies automatically from domain-specific documents.
Specifically, CRCTOL adopts a full text parsing technique and employs a
combination of statistical and lexico-syntactic methods, including a
statistical algorithm that extracts key concepts from a document
collection, a word sense disambiguation algorithm that disambiguates
words in the key concepts, a rule-based algorithm that extracts relations
between the key concepts, and a modified generalized association rule
mining algorithm that prunes unimportant relations for ontology learning.
As a result, the ontologies learned by CRCTOL are more concise and
contain a richer semantics in terms of the range and number of semantic
relations compared with alternative systems. We present two case studies
where CRCTOL is used to build a terrorism domain ontology and a sport
event domain ontology. At the component level, quantitative evaluation by
comparing with Text-To-Onto and its successor Text2Onto has shown that
CRCTOL is able to extract concepts and semantic relations with a
significantly higher level of accuracy. At the ontology level, the
quality of the learned ontologies is evaluated by either employing a set
of quantitative and qualitative methods including analyzing the graph
structural property, comparison to WordNet, and expert rating, or
directly comparing with a human-edited benchmark ontology, demonstrating
the high quality of the ontologies learned.
Jiann-Cherng, S.
"The integration system for librarians'
bibliomining." The Electronic Library vol. 28, n.
5 (2010). pp. 709-721.
http://dx.doi.org/10.1108/02640471011081988
Purpose – For library service, bibliomining is concisely defined as the
data mining techniques used to extract patterns of behavior-based
artifacts from library systems. The bibliomining process includes
identifying topics, creating a data warehouse, refining data, exploring
data and evaluating results. The cases of practical implementations and
applications in different areas have proved that the properly enough and
consolidated data warehouse is the critical promise to successful data
mining applications. However, the data warehouse creation in the
processing of various data sources obviously hampers librarians to apply
bibliomining to improve their services and operations. Moreover, most
market data mining tools are even more complex for librarians to adopt
bibliomining. The purpose of this paper is to propose a practical
application model for librarian bibliomining, then develop its
corresponding data processing prototype system to guarantee the success
of applying data mining in libraries. Design/methodology/approach – The
rapid prototyping software development method was applied to design a
prototype bibliomining system. In order to evaluate the effectiveness of
the system, there was a comparison experiment of accomplishing an
assigned task for 15 librarians. Findings – With the results of system
usability scale (SUS) comparison and turn-around time analysis, it was
established that the proposed model and the developed prototype system
can really help librarians handle bibliomining applications better.
Originality/value – The proposed novel application bibliomining model and
its developed integration system are proved to be effective and efficient
in bibliomining by the task-oriented experiment and SUS to 15 librarians.
Comparing turn-around time to accomplish the assigned task, about 35 per
cent in terms of time was saved. Librarians really require an appropriate
integration tool to assist them in successful bibliomining
applications.
Kai, G., W. Yong-Cheng, et al.
"Similar interest clustering and
partial back-propagation-based recommendation in digital
library." Library Hi Tech vol. 23, n. 4
(2005). pp.:
http://www.emeraldinsight.com/10.1108/07378830510636364
This
purpose of this paper is to propose a recommendation approach for
information retrieval. Design/methodology/approach – Relevant results are
presented on the basis of a novel data structure named FPT-tree, which is
used to get common interests. Then, data is trained by using a partial
back-propagation neural network. The learning is guided by users' click
behaviors. Findings – Experimental results have shown the effectiveness
of the approach. Originality/value – The approach attempts to integrate
metric of interests (e.g., click behavior, ranking) into the strategy of
the recommendation system. Relevant results are first presented on the
basis of a novel data structure named FPT-tree, and then, those results
are trained through a partial back-propagation neural network. The
learning is guided by users' click behaviors.
Kao, S. C., H. C. Changb, et al.
"Decision support for the
academic library acquisition budget allocation via circulation database
mining." Information Processing & Management
vol. 39, n. 1 (2003). pp.:
http://www.sciencedirect.com/science/journal/03064573
Many
approaches to decision support for the academic library acquisition
budget allocation have been proposed to diversely reflect the management
requirements. Different from these methods that focus mainly on either
statistical analysis or goal programming, this paper introduces a model
(ABAMDM, acquisition budget allocation model via data mining) that
addresses the use of descriptive knowledge discovered in the historical
circulation data explicitly to support allocating library acquisition
budget. The major concern in this study is that the budget allocation
should be able to reflect a requirement that the more a department makes
use of its acquired materials in the present academic year, the more it
can get budget for the coming year. The primary output of the ABAMDM used
to derive weights of acquisition budget allocation contains two parts.
One is the descriptive knowledge via utilization concentration and the
other is the suitability via utilization connection for departments
concerned. An application to the library of Kun Shan University of
Technology was described to demonstrate the introduced ABAMDM in
practice.
Kostoff, R. N.
"Literature-Related Discovery."
Annual Review of Information Science and Technology (ARIST)
vol. 43, n. (2009). pp. 241-287.
http://onlinelibrary.wiley.com/doi/10.1002/aris.144.v43:1/issuetoc
Literature-related discovery (LRD) is linking two or more literature
concepts that have heretofore not been linked (i.e., disjoint), in order
to produce novel, interesting, plausible, and intelligible knowledge. LRD
has two components: Literature-based discovery (LBD) generates potential
discovery through literature analysis alone, whereas literature-assisted
discovery (LAD) generates potential discovery through a combination of
literature analysis and interactions among selected literature authors.
In turn, there are two types of LBD and LAD: open discovery systems
(ODS), where one starts with a problem and arrives at a solution, and
closed discovery systems (CDS), where one starts with a problem and a
solution, then determines the mechanism(s) that links them.
Ku, L.-W., H.-W. Ho, et al.
"Opinion mining and relationship
discovery using CopeOpi opinion analysis system." Journal
of the American Society for Information Science and Technology
vol. 60, n. 7 (2009). pp. 1486-1503.
http://dx.doi.org/10.1002/asi.21067
We
present CopeOpi, an opinion-analysis system, which extracts from the Web
opinions about specific targets, summarizes the polarity and strength of
these opinions, and tracks opinion variations over time. Objects that
yield similar opinion tendencies over a certain time period may be
correlated due to the latent causal events. CopeOpi discovers
relationships among objects based on their opinion-tracking plots and
collocations. Event bursts are detected from the tracking plots, and the
strength of opinion relationships is determined by the coverage of these
plots. To evaluate opinion mining, we use the NTCIR corpus annotated with
opinion information at sentence and document levels. CopeOpi achieves
sentence- and document-level f-measures of 62% and 74%. For relationship
discovery, we collected 1.3M economics-related documents from 93 Web
sources over 22 months, and analyzed collocation-based, opinion-based,
and hybrid models. We consider as correlated company pairs that
demonstrate similar stock-price variations, and selected these as the
gold standard for evaluation. Results show that opinion-based and
collocation-based models complement each other, and that integrated
models perform the best. The top 25, 50, and 100 pairs discovered achieve
precision rates of 1, 0.92, and 0.79, respectively.
Lai, Y. and J. Zeng
"A cross-language personalized recommendation
model in digital libraries." Electronic Library,
The vol. 31, n. 3 (2013). pp. 264-277.
http://dx.doi.org/10.1108/EL-08-2011-0126
Purpose – The purpose of this paper is to develop a cross-language
personalized recommendation model based on web log mining, which can
recommend academic articles, in different languages, to users according
to their demands. Design/methodology/approach – The proposed model takes
advantage of web log data archived in digital libraries and learns user
profiles by means of integration analysis of a user's multiple online
behaviors. Moreover, keyword translation was carried out to eliminate
language dissimilarity between user and item profiles. Finally, article
recommendation can be achieved using various existing algorithms.
Findings – The proposed model can recommend articles in different
languages to users according to their demands, and the integration
analysis of multiple online behaviors can help to better understand a
user's interests. Practical implications – This study has a significant
implication for digital libraries in non-English countries, since English
is the most popular language in current academic articles and it is a
very common phenomenon for users in these countries to obtain literatures
presented by more than one language. Furthermore, this approach is also
useful for other text-based item recommendation systems.
Originality/value – A lot of research work has been done in the
personalized recommendation area, but few works have discussed the
recommendation problem under multiple linguistic circumstances. This
paper deals with cross-language recommendation and, moreover, the
proposed model puts forward an integration analysis method based on
multiple online behaviors to understand users' interests, which can
provide references for other recommendation systems in the digital
age.
Lappas, G.
"An overview of web mining in societal benefit
areas." Online Information Review vol. 32, n. 2
(2008). pp. 179-195.
http://ejournals.ebsco.com/direct.asp?ArticleID=4946A99CCC163D629265
Purpose - The focus of this paper is a survey of web-mining research
related to areas of societal benefit. The article aims to focus
particularly on web mining which may benefit societal areas by extracting
new knowledge, providing support for decision making and empowering the
effective management of societal issues. Design/methodology/approach -
E-commerce and e-business are two fields that have been empowered by web
mining, having many applications for increasing online sales and doing
intelligent business. Have areas of social interest also been empowered
by web mining applications? What are the current ongoing research and
trends in e-services fields such as e-learning, e-government, e-politics
and e-democracy? What other areas of social interest can benefit from web
mining? This work will try to provide the answers by reviewing the
literature for the applications and methods applied to the above fields.
Findings - There is a growing interest in applications of web mining that
are of social interest. This reveals that one of the current trends of
web mining is toward the connection between intelligent web services and
societal benefit applications, which denotes the need for
interdisciplinary collaboration between researchers from various fields.
Originality/value - On the one hand, this work presents to the web-mining
community an overview of research opportunities in societal benefit
areas. On the other hand, it presents to web researchers from various
disciplines an approach for improving their web studies by considering
web mining as a powerful research tool.
Liu, X., S. Yu, et al.
"Weighted hybrid clustering by combining
text mining and bibliometrics on a large-scale journal
database." Journal of the American Society for Information
Science and Technology vol. 61, n. 6 (2010). pp. 1105 -
1119.
http://doi.wiley.com/10.1002/asi.21312
We
propose a new hybrid clustering framework to incorporate text mining with
bibliometrics in journal set analysis. The framework integrates two
different approaches: clustering ensemble and kernel-fusion clustering.
To improve the flexibility and the efficiency of processing large-scale
data, we propose an information-based weighting scheme to leverage the
effect of multiple data sources in hybrid clustering. Three different
algorithms are extended by the proposed weighting scheme and they are
employed on a large journal set retrieved from the Web of Science (WoS)
database. The clustering performance of the proposed algorithms is
systematically evaluated using multiple evaluation methods, and they were
cross-compared with alternative methods. Experimental results demonstrate
that the proposed weighted hybrid clustering strategy is superior to
other methods in clustering performance and efficiency. The proposed
approach also provides a more refined structural mapping of journal sets,
which is useful for monitoring and detecting new trends in different
scientific fields.
Liu, X., S. Yu, et al.
"Weighted hybrid clustering by combining
text mining and bibliometrics on a large-scale journal
database." Journal of the American Society for Information
Science and Technology vol. 61, n. 6 (2010). pp. 1105-1119.
http://dx.doi.org/10.1002/asi.21312
Abstract 10.1002/asi.21312.abs We propose a new hybrid clustering
framework to incorporate text mining with bibliometrics in journal set
analysis. The framework integrates two different approaches: clustering
ensemble and kernel-fusion clustering. To improve the flexibility and the
efficiency of processing large-scale data, we propose an
information-based weighting scheme to leverage the effect of multiple
data sources in hybrid clustering. Three different algorithms are
extended by the proposed weighting scheme and they are employed on a
large journal set retrieved from the Web of Science (WoS) database. The
clustering performance of the proposed algorithms is systematically
evaluated using multiple evaluation methods, and they were cross-compared
with alternative methods. Experimental results demonstrate that the
proposed weighted hybrid clustering strategy is superior to other methods
in clustering performance and efficiency. The proposed approach also
provides a more refined structural mapping of journal sets, which is
useful for monitoring and detecting new trends in different scientific
fields.
Liwen, V., Y. Rongbin, et al.
"Web co-word analysis for business
intelligence in the Chinese environment." Aslib
Proceedings vol. 64, n. 6 (2012). pp. 653-667.
http://dx.doi.org/10.1108/00012531211281788
Purpose – The study seeks to apply Web co-word analysis to the Chinese
business environment to test the feasibility of the method there.
Design/methodology/approach – The authors selected a group of companies
in two Chinese industries, collected co-word data for the companies,
analyzed the data with multidimensional scaling (MDS), and then compared
the MDS maps generated from the co-word data with business situations to
find out if the co-word method works. Findings – The study found that the
Web co-word method could potentially be applied to the Chinese
environment. The study also found the advantages and disadvantages of the
Web co-word method vs the Web co-link method. Originality/value – Knowing
the applicability of the Web co-word method to the Chinese environment
contributes to the knowledge of this new Webometrics method. Mining
business information from the Web is more valuable when applied to a
foreign country where language and culture barriers exist. To use the
co-word method, one does not have to be able to read or write in that
language. One only needs to have the names of the companies to study,
which can be easily obtained without knowledge of the language. The value
of business information about countries such as China is obvious given
the global nature of contemporary business competition and the
significance of the Chinese economy to the world.
Löfström, T. and U. Johansson
"Predicting the Benefit of Rule
Extraction A Novel Component in Data Mining." Human
IT vol. 7, n. 3 (2002). pp.:
http://www.hb.se/bhs/ith/3-7/tluj.pdf
Predicting the Benefit of Rule Extraction : A Novel Component in Data
Mining
Lun-Wei, K. and C. Hsin-Hsi
"Mining opinions from the Web: Beyond
relevance retrieval." Journal of the American Society for
Information Science and Technology vol. 58, n. 12 (2007).
pp.:
http://www3.interscience.wiley.com/cgi-bin/jtoc/76501873/
Documents discussing public affairs, common themes, interesting products,
and so on, are reported and distributed on the Web. Positive and negative
opinions embedded in documents are useful references and feedbacks for
governments to improve their services, for companies to market their
products, and for customers to purchase their objects. Web opinion mining
aims to extract, summarize, and track various aspects of subjective
information on the Web. Mining subjective information enables traditional
information retrieval (IR) systems to retrieve more data from human
viewpoints and provide information with finer granularity. Opinion
extraction identifies opinion holders, extracts the relevant opinion
sentences, and decides their polarities. Opinion summarization recognizes
the major events embedded in documents and summarizes the supportive and
the nonsupportive evidence. Opinion tracking captures subjective
information from various genres and monitors the developments of opinions
from spatial and temporal dimensions. To demonstrate and evaluate the
proposed opinion mining algorithms, news and bloggers' articles are
adopted. Documents in the evaluation corpora are tagged in different
granularities from words, sentences to documents. In the experiments,
positive and negative sentiment words and their weights are mined on the
basis of Chinese word structures. The f-measure is 73.18% and 63.75% for
verbs and nouns, respectively. Utilizing the sentiment words mined
together with topical words, we achieve f-measure 62.16% at the sentence
level and 74.37% at the document level.
MacMillan, M.
"Mining E-mail to Improve Information Literacy
Instruction." Evidence Based Library & Information
Practice vol. 5, n. 2 (2010). pp. 103-106.
http://ejournals.library.ualberta.ca/index.php/EBLIP/article/viewFile/7996/6968
The
article discusses a study which described how an academic librarian mined
the amount and type of e-mail questions sent by students and how such
strategy resulted in an improvement of information literacy instruction.
A background on the Mount Royal University (MRU) in Calgary, Alberta is
offered. Data collection and extraction from August 2008 to July 2009 is
described. A discussion on the implementation of changes to the
information library delivery based on the research findings during the
2009-2010 academic year is detailed.
Marrero, M., S. Sánchez-Cuadrado, et al.
"Sistemas de
recuperación de información adaptados al dominio biomédico."
El Profesional de la Información vol. 19, n. 3
(2010). pp. 246-254.
http://elprofesionaldelainformacion.metapress.com/media/7pab6qgbap6tpkecbp6j/contributions/u/4/8/0/u480m8g27l202736.pdf
La
terminología usada en biomedicina tiene rasgos léxicos que han requerido
la elaboración de recursos terminológicos y sistemas de recuperación de
información con funciones específicas. Las principales características
son las elevadas tasas de sinonimia y homonimia, debidas a fenómenos como
la proliferación de siglas polisémicas y su interacción con el lenguaje
común. Los sistemas de recuperación de información en el dominio
biomédico utilizan técnicas orientadas al tratamiento de estas
peculiaridades léxicas. Se revisan algunas de estas técnicas, como la
aplicación de Procesamiento del Lenguaje Natural (BioNLP), la
incorporación de recursos léxico-semánticos, y la aplicación de
Reconocimiento de Entidades (BioNER). Se presentan los métodos de
evaluación adoptados para comprobar la adecuación de estas técnicas en la
recuperación de recursos biomédicos.; The terminology used in biomedicine
has lexical characteristics that have required the elaboration of
terminological resources and information retrieval systems with specific
functionalities. The main characteristics are the high rates of synonymy
and homonymy, due to phenomena such as the proliferation of polysemic
acronyms and their interaction with common language. Information
retrieval systems in the biomedical domain use techniques oriented to the
treatment of these lexical peculiarities. In this paper we review some of
these techniques, such as the application of Natural Language Processing
(BioNLP), the incorporation of lexical-semantic resources, and the
application of Named Entity Recognition (BioNER). Finally, we present the
evaluation methods adopted to assess the suitability of these techniques
for retrieving biomedical resources.
Martínez Méndez, F. J. and R. López Carreño
"Análisis prospectivo
de las tendencias de desarrollo de los portales periodísticos
españoles." Scire: Representación y organización del
conocimiento vol. 11, n. 2 (2005). pp. 33-62.
http://ibersid.eu/ojs/index.php/scire/article/view/1520/1498
El
estudio taxonómico de la estructura de los portales periodísticos y el
análisis de la frecuencia de aparición de sus componentes ?productos
informativos, productos documentales y servicios de valor añadido? han
fijado el actual estado de la cuestión sobre su desarrollo. La minería de
datos es una tecnología que permite identificar tendencias y patrones
seguidos en ese desarrollo a partir de la base de conocimientos aportada
por los trabajos anteriormente llevados a cabo, permitiendo recuperar
información suplementaria a la ya explicitada.
Fuente: Facultad de Traducción. U. de Salamanca.